working on small-worldness

2 years ago · 1ad3590acb
parent baf5d0ad8f
commit 1ad3590acb
3 changed files with 680 additions and 62 deletions
--- a/main.ipynb
+++ b/main.ipynb
@ -41,7 +41,110 @@
    "G_brighkite_friends = nx.read_gpickle(os.path.join('data', 'brightkite', 'brightkite_friendships_graph.gpickle'))\n",
    "G_gowalla_friends = nx.read_gpickle(os.path.join('data', 'gowalla', 'gowalla_friendships_graph.gpickle'))\n",
    "G_foursquareEU_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareEU_friendships_graph.gpickle'))\n",
-    "G_foursquareIT_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareIT_friendships_graph.gpickle'))"
+    "G_foursquareIT_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareIT_friendships_graph.gpickle'))\n",
+    "\n",
+    "checkins_graphs = [G_brighkite_checkins, G_gowalla_checkins, G_foursquareEU_checkins, G_foursquareIT_checkins]\n",
+    "friendships_graph = [G_brighkite_friends, G_gowalla_friends, G_foursquareIT_friends, G_foursquareEU_friends]\n",
+    "\n",
+    "graphs_all = checkins_graphs + friendships_graph\n",
+    "\n",
+    "analysis_results = pd.read_pickle('analysis_results_acc.pkl')\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "## Graph Theory\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Aim of the project\n",
+    "\n",
+    "Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. \n",
+    "\n",
+    "These networks, typically generated directly or indirectly by human activity and interaction (and therefore hereafter dubbed social”), appear in a large variety of contexts and often exhibit a surprisingly similar structure. One of the most important notions that researchers have been trying to capture in such networks is “node centrality”: ideally, every node (often representing an individual) has some degree of influence or importance within the social domain under  consideration, and one expects such importance to surface in the structure of the social network; centrality is a quantitative measure that aims at revealing the importance of a node.\n",
+    "\n",
+    "Among the types of centrality that have been considered in the literature, many have to do with distances between nodes. Take, for instance, a node in an undirected connected network: if the sum of distances to all other nodes is large, the node under consideration is peripheral; this is the starting point to define Bavelas's closeness centrality \\cite{closeness}, which is the reciprocal of peripherality (i.e., the reciprocal of the sum of distances to all other nodes). \n",
+    "\n",
+    "The role played by shortest paths is justified by one of the most well-known features of complex networks, the so-called small-world phenomenon. A small-world network is a graph where the average distance between nodes is logarithmic in the size of the network, whereas the clustering coefficient is larger (that is, neighborhoods tend to be denser) than in a random Erdős-Rényi graph with the same size and average distance. The fact that social networks (whether electronically mediated or not) exhibit the small-world property is known at least since Milgram's famous experiment \\cite{} and is arguably the most popular of all features of complex networks. For instance, the average distance of the Facebook graph was recently established to be just $4.74$.\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The Erdős-Rényi model\n",
+    "\n",
+    "Before 1960, graph theory mainly dealt with the properties of specific individual graphs. In the 1960s, Paul Erdős and Alfred Rényi initiated a systematic study of random graphs. Random graph theory is, in fact, not the study of individual graphs, but the study of a statistical ensemble of graphs (or, as mathematicians prefer to call it, a \\emph{probability space} of graphs). The ensemble is a class consisting of many different graphs, where each graph has a probability attached to it. A property studied is said to exist with probability $P$ if the total probability of a graph in the ensemble possessing that property is $P$ (or the total fraction of graphs in the ensemble that has this property is $P$). This approach allows the use of probability theory in conjunction with discrete mathematics for studying graph ensembles.  A property is said to exist for a class of graphs if the fraction of graphs in the ensemble which does not have this property is of zero measure. This is usually termed as a property of \\emph{almost every (a.e.)} graph. Sometimes the terms “almost surely” or “with high probability” are also used (with the former usually taken to mean that the residual probability vanishes exponentially with the system size). \n",
+    "\n",
+    "\n",
+    "## Erdős-Rényi graphs\n",
+    "\n",
+    "Two well-studied graph ensembles are $G_{N,M}$, the ensemble of all graphs with $N$ nodes and $M$ edges, and $G_{N,p}$, the ensemble of all graphs with $N$ nodes and probability $p$ of any two nodes being connected. These two families, initially studied by Erdős and Rényi, are known to be similar if $M = \\binom{N}{2} p$, so as long $p$ is not too close to $0$ or $1$ they are referred to as ER graphs. \n",
+    "\n",
+    "An important attribute of a graph is the average degree, i.e., the average number of edges connected to each node. We will denote the degree of the ith node by $k_i$ and the average degree by $ \\langle r \\rangle $ . $N$-vertex graphs with $\\langle k \\rangle = O(N^0)$ are called sparse graphs. \n",
+    "\n",
+    "An interesting characteristic of the ensemble $G_{N,p}$ is that many of its properties have a related threshold function, $p_t(N)$, such that the property exists, in the “thermodynamic limit” of $N \\to \\infty$  with probability 0 if $p < p_t$ , and with probability $1$ if $p > p_t$ . This phenomenon is the same as the physical concept of a percolation phase transition. \n",
+    "\n",
+    "Another property is the average path length between any two nodes, which in almost every graph of the ensemble (with $\\langle k \\rangle > 1$ and finite) is of order $\\ln N$ . The small, logarithmic distance is actually the origin of the “small-world” phenomena that characterize networks.\n",
+    "\n",
+    "\n",
+    "## Scale-free networks\n",
+    "\n",
+    "The Erdős-Rényi model has traditionally been the dominant subject of study in the field of random graphs. Recently, however, several studies of real-world networks have found that the ER model fails to reproduce many of their observed properties. One of the simplest properties of a network that can be measured directly is the degree distribution, or the fraction P(k) of nodes having k connections (degree $k$). A well-known result for ER networks is that the degree distribution is Poissonian,\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    P(k) = \\frac{e^{z} z^k}{k!}\n",
+    "\\end{equation}\n",
+    "\n",
+    "Where $z = \\langle k \\rangle$. is the average degree.  Direct measurements of the degree distribution for real networks show that the Poisson law does not apply. Rather, often these nets exhibit a scale-free degree distribution:\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K\n",
+    "\\end{equation}\n",
+    "\n",
+    "Where $c \\sim (\\gamma -1)m^{\\gamma - 1}$ is a normalization factor, and $m$ and $K$ are the lower and upper cutoffs for the degree of a node, respectively. The divergence of moments higher then $\\lceil \\gamma -1 \\rceil$ (as  $K \\to \\infty$ when $N \\to \\infty$) is responsible for many of the anomalous properties attributed to scale-free networks. \n",
+    "\n",
+    "All real-world networks are finite and therefore all their moments are finite. The actual value of the cutoff K plays an important role. It may be approximated by noting that the total probability of nodes with $k > K$ is of order $1/N$\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\int_K^\\infty P(k) dk \\sim \\frac{1}{N}\n",
+    "\\end{equation}\n",
+    "\n",
+    "This yields the result\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    K \\sim m N^{1/(\\gamma -1)}\n",
+    "\\end{equation}\n",
+    "\n",
+    "The degree distribution alone is not enough to characterize the network. There are many other quantities, such as the degree-degree correlation (between connected nodes), the spatial correlations, the clustering coefficient, the betweenness or central-ity distribution, and the self-similarity exponents.\n",
+    "\n",
+    "# Diameter and fractal dimension\n",
+    "\n",
+    "Regular lattices can be viewed as networks embedded in Euclidean space, of a well-defined dimension, $d$. This means that $n(r)$, the number of nodes within a distance $r$ from an origin, grows as $n(r) \\sim r^d$ (for large $r$). For fractal objects, $d$ in the last relation may be a non-integer and is replaced by the fractal dimension $d_f$ \n",
+    "\n",
+    "An example of a network where the above power laws are not valid is the Cayley tree (also known as the Bethe lattice). The Cayley tree is a regular graph, of fixed degree $z$, and no loops. An infinite Cayley tree cannot be embedded in a Euclidean space of finite dimensionality. The number of nodes at $l$ is $n(l) \\sim (z - 1)^l$ . Since the exponential growth is faster than any power law, Cayley trees are referred to as infinite-dimensional systems. \n",
+    "\n",
+    "In most random network models, the structure is locally tree-like (since most loops occur only for $n(l) \\sim N$), and since the number of nodes grows as $n(l) \\sim \\langle k - 1 \\rangle^l$, they are also infinite dimensional. As a consequence, the diameter of such graphs (i.e., the minimal path between the most distant nodes) scales as $D \\sim \\ln N$. Many properties of ER networks, including the logarithmic diameter, are also present in Cayley trees. This small diameter in ER graphs and Cayley trees is in contrast to that of finite-dimensional lattices, where $D \\sim N^{1/d_l}$. \n",
+    "\n",
+    "Similar to ER, percolation on infinite-dimensional lattices and the Cayley tree  yields a critical threshold $p_c = 1/(z - 1)$. For $p > p_c$, a “giant cluster” of order $N$ exists, whereas for $p < pc$,only small clusters appear. For infinite-dimensional lattices (similar to ER networks) at criticality, $p =\n",
+    "p_c$ , the giant component is of size $N^{2/3}$. This last result follows from the fact that percolation on lattices in dimension $d \\geq d_c = 6$ is in the same universality class as infinite-dimensional percolation, where the fractal dimension of the giant cluster is $d_f = 4$, and therefore the size of the giant cluster scales as $N^{d_f/d_c} = N^{2/3}$. The dimension $d_c$ is called the “upper critical dimension.” Such an upper critical dimension exists not only in percolation phenomena, but also in other physical models, such as in the self-avoiding walk model for polymers and in the Ising model for magnetism; in both these cases $d_c = 4$.\n",
+    "\n",
+    "Watts and Strogatz suggested a model that retains the local high clustering of lattices (i.e., the neighbors of a node have a much higher probability of being neighbors than in random graphs) while reducing the diameter to $D \\sim \\ln N$ . This so-called, “small-world network” is achieved by replacing a fraction $\\varphi$ of the links in a regular lattice with random links, to random distant neighbors. \n",
+    "\n",
+    "## Random graphs as a model of real networks\n",
+    "\n",
+    "Many natural and man-made systems are networks, i.e., they consist of objects and interactions between them. These include computer networks, in particular the Internet, logical networks, such as links between WWW pages, and email networks, where a link represents the presence of a person's address in another person's address book. Social interactions in populations, work relations, etc. can also be modeled by a network structure. Networks can also describe possible actions or movements of a system in a configuration space (a phase space), and the nearest configurations are connected by a link. All the above examples and many others have a graph structure that can be studied. Many of them have some ordered structure, derived from geographical or geometrical considerations, cluster and group formation, or other specific properties.  However, most of the above networks are far from regular lattices and are much more complex and random in structure. Therefore, it can be assumed (with a lot of precaution) that they maintain many properties of the appropriate random graph model. \n",
+    "\n",
+    "In many aspects scale-free networks can be regarded as a generalization of ER networks. For large $\\gamma$ (usually, for $\\gamma > 4$) the properties of scale-free networks, such as distances, optimal paths, and percolation, are the same as in ER networks. In contrast, for $\\gamma < 4$, these properties are very different and can be regarded as anomalous. The anomalous behavior of scale-free networks is due to the strong heterogeneity in the degree of the nodes, which breaks the node-to-node translational homogeneity (symmetry) that exists in the classical\n",
+    "homogeneous networks, such as lattices, Cayley trees, and ER graphs. The small variation of the degrees in the ER model or in scale-free networks with large $gamma$ is insufficient to break this symmetry, and therefore many results for ER networks are the same as for Cayley trees, where the degree of each node is the same.\n",
+    "\n",
+    "---"
   ]
  },
  {
@ -328,11 +431,11 @@
   "source": [
    "### Introduzione da scrivere\n",
    "\n",
-    "qualcosa\n",
+    "To help us visualize the results of our analysis we can create a dataframe and fill it with all the information that we will retrive from our networks in this section.\n",
    "\n",
-    "---\n",
+    "As we'll see in the cells below, the full networks are very big, even after the filtering that we did. This leads to long run times for the functions that we are going to use. To avoid this, we are going to use a sub-sample of the networks. Depending on how much we want to sample, our results will be more or less accurate. \n",
    "\n",
-    "To help us visualize the results of our analysis we can create a dataframe and fill it with all the information that we will retrive from our networks in this section."
+    "What I suggest to do while reviewing this network is to use higher values for the sampling rate, so that you can see the results faster. This will give you a general idea of how the implemented functions work. Then, at the end of this section I have provided a link from my GitHub repository where you can download the results obtained with very low sampling rates. In this way you can test the functions with mock-networks and see if they work as expected, then we can proceed with the analysis using the more accurate results."
   ]
  },
  {
@ -371,38 +474,7 @@
    "The degree distribution, $P(k)$, is the fraction of sites having degree $k$. We know from the literature that many real networks do not exhibit a Poisson degree distribution, as predicted in the ER model. In fact, many of them exhibit a distribution with a long, power-law, tail, $P(k) \\sim k^{-\\gamma}$ with some $γ$, usually between $2$ and 3$.\n",
    "\n",
    "For know, we will just compute the average degree of our networks and add it to the dataframe.\n",
-    "\n",
-    "<!-- The Erdős-Rényi model has traditionally been the dominant subject of study in the field of random graphs. Recently, however, several studies of real-world networks have found that the ER model fails to reproduce many of their observed properties. One of the simplest properties of a network that can be measured directly is the degree distribution, or the fraction $P(k)$ of nodes having k connections (degree $k$). A well-known result for ER networks is that the degree distribution is Poissonian,\n",
-    "\n",
-    "\\begin{equation}\n",
-    "    P(k) = \\frac{e^{z} z^k}{k!}\n",
-    "\\end{equation}\n",
-    "\n",
-    "Where $z = \\langle k \\rangle$. is the average degree. Direct measurements of the degree distribution for real networks show that the Poisson law does not apply. Rather, often these nets exhibit a scale-free degree distribution:\n",
-    "\n",
-    "\\begin{equation}\n",
-    "    P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K\n",
-    "\\end{equation}\n",
-    "\n",
-    "Where $c \\sim (\\gamma -1)m^{\\gamma - 1}$ is a normalization factor, and $m$ and $K$ are the lower and upper cutoffs for the degree of a node, respectively. The divergence of moments higher then $\\lceil \\gamma -1 \\rceil$ (as  $K \\to \\infty$ when $N \\to \\infty$) is responsible for many of the anomalous properties attributed to scale-free networks. \n",
-    "\n",
-    "All real-world networks are finite and therefore all their moments are finite. The actual value of the cutoff K plays an important role. It may be approximated by noting that the total probability of nodes with $k > K$ is of order $1/N$\n",
-    "\n",
-    "\\begin{equation}\n",
-    "    \\int_K^\\infty P(k) dk \\sim \\frac{1}{N}\n",
-    "\\end{equation}\n",
-    "\n",
-    "This yields the result\n",
-    "\n",
-    "\\begin{equation}\n",
-    "    K \\sim m N^{1/(\\gamma -1)}\n",
-    "\\end{equation}\n",
-    "\n",
-    "The degree distribution alone is not enough to characterize the network. There are many other quantities, such as the degree-degree correlation (between connected nodes), the spatial correlations, the clustering coefficient, the betweenness or central-ity distribution, and the self-similarity exponents.\n",
-    "\n",
-    "---\n",
-    "\n",
-    "Let's see if our networks are scale-free or not. We can use the `degree_distribution` function from the `utils` module to plot the degree distribution of a graph. It takes a networkx graph object as input and returns a plot of the degree distribution. We expect to see a power-law distribution and not a Poissonian one. -->"
+    "\n"
   ]
  },
  {
@ -413,7 +485,9 @@
   "source": [
    "for G in graphs_all:\n",
    "    avg_deg = np.mean([d for n, d in G.degree()])\n",
-    "    analysis_results.loc[analysis_results['Graph'] == G.name, 'Average Degree'] = avg_deg"
+    "    analysis_results.loc[analysis_results['Graph'] == G.name, 'Average Degree'] = avg_deg\n",
+    "\n",
+    "analysis_results"
   ]
  },
  {
@ -477,7 +551,7 @@
    "for graph in checkins_graphs:\n",
    "    print(\"\\nComputing average clustering coefficient for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
-    "    avg_clustering = average_clustering_coefficient(graph, 0.6)\n",
+    "    avg_clustering = average_clustering_coefficient(graph, 0.3)\n",
    "    end = time.time()\n",
    "\n",
    "    print(\"\\tAverage clustering coefficient: {}\".format(avg_clustering))\n",
@ -487,20 +561,16 @@
    "for graph in friendships_graph:\n",
    "    print(\"\\nComputing average clustering coefficient for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
-    "    avg_clustering = average_clustering_coefficient(graph, 0.2)\n",
+    "    avg_clustering = average_clustering_coefficient(graph, 0.1)\n",
    "    end = time.time()\n",
    "\n",
    "    print(\"\\tAverage clustering coefficient: {}\".format(avg_clustering))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
-    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Clustering Coefficient'] = avg_clustering"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we can use our formula to compute the clustering coefficient in a small world network"
+    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Clustering Coefficient'] = avg_clustering\n",
+    "\n",
+    "analysis_results\n",
+    "# save the results as pandas dataframe object\n",
+    "analysis_results.to_pickle('analysis_results.pkl')"
   ]
  },
  {
@ -601,7 +671,7 @@
    "    print(\"\\nComputing average shortest path length for graph: \", graph.name)\n",
    "\n",
    "    start = time.time()\n",
-    "    average_shortest_path_length = average_shortest_path(graph, 0.6)\n",
+    "    average_shortest_path_length = average_shortest_path(graph, 0.3)\n",
    "    end = time.time()\n",
    "\n",
    "    print(\"\\tAverage shortest path length: {}\".format(round(average_shortest_path_length,2)))\n",
@ -614,14 +684,18 @@
    "    print(\"\\nComputing average shortest path length for graph: \", graph.name)\n",
    "\n",
    "    start = time.time()\n",
-    "    average_shortest_path_length = average_shortest_path(graph, 0.3)\n",
+    "    average_shortest_path_length = average_shortest_path(graph, 0.1)\n",
    "    end = time.time()\n",
    "\n",
    "    print(\"\\tAverage shortest path length: {}\".format(round(average_shortest_path_length,2)))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
    "\n",
    "    \n",
-    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Shortest Path Length'] = average_shortest_path_length"
+    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Shortest Path Length'] = average_shortest_path_length\n",
+    "\n",
+    "analysis_results\n",
+    "# save the results as pandas dataframe object\n",
+    "analysis_results.to_pickle('analysis_results.pkl')"
   ]
  },
  {
@ -633,7 +707,7 @@
    "\n",
    "The importance of a node in a network depends on many factors. A website may be important due to its content, a router due to its capacity. Of course, all of these properties depend on the nature\n",
    "of the studied network, and may have very little to do with the graph structure of the network. We are particularly interested in the importance of a node (or a link) due to its topological  function in the network. It is reasonable to assume that the topology of a network may dictate some intrinsic importance for different nodes. One measure of centrality can be the degree of a\n",
-    "node. The higher the degree, the more the node is connected, and therefore, the higher is its centrality in the network. However, the degree is not the only factor determining a node's importance \\s\n",
+    "node. The higher the degree, the more the node is connected, and therefore, the higher is its centrality in the network. However, the degree is not the only factor determining a node's importance \n",
    "\n",
    "One of the most accepted definitions of centrality is based on counting paths going through a node. For each node, i, in the network, the number of “routing” paths to all other nodes (i.e., paths through which data flow) going through i is counted, and this number determines the centrality i. The most common selection is taking only\n",
    "the shortest paths as the routing paths. This leads to the following definition: the \\emph{betweenness centrality} of a node, i, equals the number of shortest paths between all pairs of nodes in the network going through it, i.e.,\n",
@ -680,7 +754,7 @@
    "for graph in checkins_graphs:\n",
    "    print(\"\\nComputing the approximate betweenness centrality for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
-    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.7).values()))\n",
+    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.3).values()))\n",
    "    end = time.time()\n",
    "    print(\"\\tBetweenness centrality: {} \".format(betweenness_centrality))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
@ -690,24 +764,286 @@
    "for graph in friendships_graph:\n",
    "    print(\"\\nComputing the approximate betweenness centrality for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
-    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.3).values()))\n",
+    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.1).values()))\n",
    "    end = time.time()\n",
    "    print(\"\\tBetweenness centrality: {} \".format(betweenness_centrality))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
    "\n",
    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'betweenness centrality'] = betweenness_centrality\n",
    "    \n",
+    "analysis_results\n",
+    "# save the results as pandas dataframe object\n",
+    "analysis_results.to_pickle('analysis_results.pkl')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "acc_res = \"some urls\"\n",
+    "\n",
+    "# download the results with wget\n",
+    "\n",
+    "# open the dataframe object\n",
+    "analysis_results = pd.read_pickle('analysis_results_acc.pkl')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The Small-World Model"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- ## The Small-World Model\n",
+    "\n",
+    "It should be clarified that real networks are not random. Their formation and development are dictated by a combination of many different processes and influences. These influencing conditions include natural limitations and processes, human considerations such as optimal performance and robustness, economic considerations, natural selection and many others. Controversies still exist regarding the measure to which random models represent real-world networks. However, in this chapter we will focus on random network models and attempt to show if their properties may still be used to study properties of real-world networks. \n",
+    "\n",
+    "Many real-world networks have many properties that cannot be explained by the ER model. One such property is the high clustering observed in many real-world networks. This led Watts and Strogatz to develop an alternative model, called the “small-world” model. Their idea was to begin with an ordered lattice, such as the \\emph{k-}ring (a ring where each site is connected to its $2k$ nearest neighbors - k from each side) or the two-dimensional lattice. For each site, each of the links emanating from it is removed with probability $\\varphi$ and is rewired to a randomly selected site in the network. A variant of this process is to add links rather than rewire, which simplifies the analysis without considerably affecting the results. The obtained network has the desirable properties of both an ordered lattice (large clustering) and a random network (small world), as we will discuss below.\n",
+    "\n",
+    "\n",
+    "## Clustering in a small-world network\n",
+    "\n",
+    "The simplest way to treat clustering analytically in a small-world network is to use the link addition, rather than the rewiring model. In the limit of large network size, $N \\to \\infty$, and for a fixed fraction of shortcuts $\\phi$, it is clear that the probability of forming triangle vanishes as we approach $1/N$, so the contribution of the shortcuts to the clustering is negligible. Therefore, the clustering of a small-world network is determined by its underlying ordered lattice. For example, consider a ring where each node is connected to its $k$ closest neighbors from each side. A node's number of neighbors is therefore $2k$, and thus it has $2k(2k - 1)/2 = k(2k - 1)$ pairs of neighbors. Consider a node, $i$. All of the $k$ nearest nodes on $i$'s left are connected to each other, and the same is true for the nodes on $i$'s right. This amounts to $2k(k - 1)/2 = k(k - 1)$ pairs. Now consider a node located $d$ places to the left of $k$. It is also connected to its $k$ nearest neighbors from each side. Therefore, it will be connected to $k - d$ neighbors on $i$'s right side. The total number of connected neighbor pairs is\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    k(k-1) + \\sum_{d=1}^k (k-d) = k(k-1) + \\frac{k(k-1)}{2} = \\frac{{3}{2}} k (k-1)\n",
+    "\\end{equation}\n",
+    "\n",
+    "and the clustering coefficient is:\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    C = \\frac{\\frac{3}{2}k(k-1)}{k(2k-1)} =\\frac{3 (k-1)}{2(2k-1)}\n",
+    "\\end{equation}\n",
+    "\n",
+    "For every $k > 1$, this results in a constant larger than $0$, indicating that the clustering of a small-world network does not vanish for large networks. For large values of $k$, the clustering coefficient approaches $3/4$, that is, the clustering is very high. Note that for a regular two-dimensional grid, the  clustering by definition is zero, since no triangles exist. However, it is clear that the grid has a neighborhood structure.\n",
+    "\n",
+    "## Distances in a small-world network}\n",
+    "\n",
+    "The second important property of small-world networks is their small diameter, i.e., the small distance between nodes in the network. The distance in the underlying lattice behaves as the linear length of the lattice, L. Since $N \\sim L^d$  where $d$ is the lattice dimension, it follows that the distance between nodes behaves as:\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    l \\sim L \\sim N^{1/d}\n",
+    "\\end{equation}\n",
+    "\n",
+    "Therefore, the underlying lattice has a finite dimension, and the distances on it behave as a power law of the number of nodes, i.e., the distance between nodes is large. However, when adding even a small fraction of shortcuts to the network, this behavior changes dramatically. \n",
+    "\n",
+    "Let's try to deduce the behavior of the average distance between nodes. Consider a small-world network, with dimension d and connecting distance $k$ (i.e., every node is connected to any other node whose distance from it in every linear dimension is at most $k$). Now, consider the nodes reachable from a source node with at most $r$ steps. When $r$ is small, these are just the \\emph{r-th} nearest neighbors of the source in the underlying lattice. We term the set of these neighbors a “patch”. the radius of which is $kr$ , and the number of nodes it contains is approximately $n(r) = (2kr)d$. \n",
+    "\n",
+    "We now want to find the distance r for which such a patch will contain about one shortcut. This will allow us to consider this patch as if it was a single node in a randomly connected network. Assume that the probability for a single node to have a shortcut is $\\Phi$. To find the length for which approximately one shortcut is encountered, we need to solve for $r$ the following equation: $(2kr)^d \\Phi = 1$. The correlation length $\\xi$ defined as the distance (or linear size of a patch) for which a shortcut will be encountered with high probability is therefore,\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\xi = \\frac{1}{k \\Phi^{1/d}}\n",
+    "\\end{equation}\n",
+    "\n",
+    "Note that we have omitted the factor 2, since we are interested in the order of magnitude. Let us denote by $V(r)$ the total number of nodes reachable from a node by at most $r$ steps, and by $a(r)$, the number of nodes added to a patch in the \\emph{r-th} step. That is, $a(r) = n(r) - n(r-1)$. Thus,\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    a(r) \\sim \\frac{\\text{d} n(r)}{\\text{d} r} = 2kd(2kr)^{d-1}\n",
+    "\\end{equation}\n",
+    "\n",
+    "When a shortcut is encountered at the r step from a node, it leads to a new patch \\footnote{It may actually lead to an already encountered patch, and two patches may also merge after some steps, but this occurs with negligible probability when $N \\to \\infty$ until most of the network is reachable}. This new patch occurs after $r'$ steps, and therefore the number of nodes reachable from its origin is $V (r - r')$. Thus, we obtain the recursive relation\n",
+    "\n",
+    "\\begin{equation} \n",
+    "    V(r) = \\sum_{r'=0}^r a(r') [1 + \\xi^{-d}V(r-r')]\n",
+    "\\end{equation}\n",
+    "\n",
+    "where the first term stands for the size of the original patch, and the second term is derived from the probability of hitting a shortcut, which is approximately $\\xi -d $ for every new node encountered. To simplify the solution of \\ref{eq:recursion}, it can be approximated by a differential equation. The sum can be approximated by an integral, and then the equation can be differentiated with respect to $r$ . For simplicity, we will concentrate here on the solution for the one-dimensional case, with $k = 1$, where $a(r) = 2$. Thus, one obtains\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\frac{\\text{d} V(r)}{\\text{d} r} = 2 [1 + V(r)/\\xi]\n",
+    "\\end{equation}\n",
+    "\n",
+    "the solution of which is:\n",
+    "\n",
+    "\\begin{equation} \n",
+    "    V(r) = \\xi \\left(e^{2r/\\xi} -1\\right)\n",
+    "\\end{equation}\n",
+    "\n",
+    "For $r \\ll \\xi$, the exponent can be expanded in a power series, and one obtains $V(r) \\sim 2r = n(r)$, as expected, since usually no shortcut is encountered. For $r \\ gg \\xi$, $V(r)$. An approximation for the average distance between nodes can be obtained by equating $V(r)$ from to the total number of nodes, $V(r) = N$. This results in\n",
+    "\n",
+    "\\begin{equation}=\n",
+    "    r \\sim \\frac{\\xi}{2} \\ln \\frac{N}{\\xi}\n",
+    "\\end{equation}\n",
+    "\n",
+    "As apparent from \\ref{eq:average distance}, the average distance in a small-world network behaves as the distance in a random graph with patches of size $\\xi$ behaving as the nodes of the random graph. -->\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Detecting Small-Worldness"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we have seen, many real technological, biological, social, and information networks fall into the broad class of _small-world_ networks, a middle ground between regular and random networks: they have high local clustering of elements, like regular networks, but also short path lengths between elements, like random networks. Membership of the small-world network class also implies that the corresponding systems have dynamic properties different from those of equivalent random or regular networks. \n",
+    "\n",
+    "However, the existing _small-world_ definition is a categorical one, and breaks the continuum of network topologies into the three classes of regular, random, and small-world networks, with the latter being the broadest. It is unclear to what extent the real-world systems in the small-world class have common network properties and to what specific point in the \\emph{middle-ground} (between random and regular) a network generating model must be tuned to genuinely capture the topology of such systems. \n",
+    "\n",
+    "The current _state of the art_ algorithm in the field of small-world network analysis is based on the idea that small-world networks should have some topological structure, reflected by properties such as an high clustering coefficient. On the other hand, random networks (as the Erd ̋os-Rényi model) have no such structure and, usually, a low clustering coefficient. The current \\emph{state of the art} algorithms can be empirically described in the following steps:\n",
+    "\n",
+    "\n",
+    "* Compute the average shortest path length $L$ and the average clustering coefficient $C$ of the target system.\n",
+    "* Create an ensemble of random networks with the same number of nodes and edges as the target system. Usually, the random networks are generated using the Erd ̋os-Rényi model.\n",
+    "* Compute the average shortest path length $L_r$ and the average clustering coefficient $C_r$ of each random network in the ensemble.\n",
+    "* Compute the normalized average shortest path length $\\lambda := L/L_n$ and the normalized average clustering coefficient $\\gamma := C/C_n$\n",
+    "* If $\\lambda$ and $\\gamma$ are close to 1, then the target system is a small-world network.\n",
+    "\n",
+    "\n",
+    "One of the problems with this interpretations is that we have no information on how the average shortest path scales with the network size. Specifically, a small-world network is defined to be a network where the typical distance $L$ between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes $N$ in the network.\n",
+    "$$ L \\propto N $$\n",
+    "But since we are working with a real-world network, there is no such thing as \"same network with different number of nodes\". So this definition, can't be applied in this case. \n",
+    "\n",
+    "Furthermore, let's try to take another approach. We can consider a definition of small-world network that it's not directly depend of $\\gamma$ and $\\lambda$, e.g:\n",
+    "\n",
+    "> _A small-world network is a spatial network with added long-range connections_\n",
+    "\n",
+    "Then we still cannot make robust implications as to whether such a definition is fulfilled just using $\\gamma$ and $\\lambda$ (or in fact other network measures). The interpretation of many studies assumes that all networks are a realization of the Watts-Strogatz model for some rewiring probability, which is not justified at all! We know many other network models, whose realizations are entirely different from the Watts-Strogatz model. \n",
+    "\n",
+    "The above method is not robust to measurement errors. Small errors when establishing a network from measurements suffice to make, e.g., a lattice look like a small-world network. \n",
+    "\n",
+    "<!-- See \\cite{https://doi.org/10.48550/arxiv.1111.4570} and \\cite{10.3389/fnhum.2016.00096}.  -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# open the dataframe object\n",
+    "analysis_results = pd.read_pickle('analysis_results.pkl')\n",
    "analysis_results"
   ]
  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Distribution of Degree\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Erdős-Rényi model has traditionally been the dominant subject of study in the field of random graphs. Recently, however, several studies of real-world networks have found that the ER model fails to reproduce many of their observed properties. One of the simplest properties of a network that can be measured directly is the degree distribution, or the fraction $P(k)$ of nodes having k connections (degree $k$). A well-known result for ER networks is that the degree distribution is Poissonian,\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    P(k) = \\frac{e^{z} z^k}{k!}\n",
+    "\\end{equation}\n",
+    "\n",
+    "Where $z = \\langle k \\rangle$. is the average degree. Direct measurements of the degree distribution for real networks show that the Poisson law does not apply. Rather, often these nets exhibit a scale-free degree distribution:\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K\n",
+    "\\end{equation}\n",
+    "\n",
+    "Where $c \\sim (\\gamma -1)m^{\\gamma - 1}$ is a normalization factor, and $m$ and $K$ are the lower and upper cutoffs for the degree of a node, respectively. The divergence of moments higher then $\\lceil \\gamma -1 \\rceil$ (as  $K \\to \\infty$ when $N \\to \\infty$) is responsible for many of the anomalous properties attributed to scale-free networks. \n",
+    "\n",
+    "All real-world networks are finite and therefore all their moments are finite. The actual value of the cutoff K plays an important role. It may be approximated by noting that the total probability of nodes with $k > K$ is of order $1/N$\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\int_K^\\infty P(k) dk \\sim \\frac{1}{N}\n",
+    "\\end{equation}\n",
+    "\n",
+    "This yields the result\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    K \\sim m N^{1/(\\gamma -1)}\n",
+    "\\end{equation}\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Let's see if our networks are scale-free or not. We can use the `degree_distribution` function from the `utils` module to plot the degree distribution of a graph. It takes a networkx graph object as input and returns a plot of the degree distribution. We expect to see a power-law distribution and not a Poissonian one."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# save the results as pandas dataframe object\n",
-    "analysis_results.to_pickle('analysis_results.pkl')"
+    "for G in checkins_graphs:\n",
+    "    degree_distribution(G)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for graph in friendships_graph:\n",
+    "    degree_distribution(graph)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can clearly see from the graphs obtained, the degree distribution of the networks is not Poissonian, but rather scale-free. This is a good indication that the networks are not random, but rather small-world.\n",
+    "\n",
+    "Let's try to plot the distribution degree of a random Erdos-Renyi graph with the same number of nodes and a probability of edge creation equal to the number of edges of the network divided by the number of possible edges. We expect to see a Poissonian distribution.\n",
+    "\n",
+    "> This is a time saving approach, NOT a rigorous one. If we want to be rigorous, should follow the algorithm proposed by Maslov and Sneppen, implemented in the the networkx function `random_reference`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# for each network, create a erdos-renyi graph with the same number of nodes and edges \n",
+    "\n",
+    "for graph in checkins_graphs:\n",
+    "    G = nx.erdos_renyi_graph(graph.number_of_nodes(), graph.number_of_nodes()/graph.number_of_edges())\n",
+    "    G.name = graph.name + \" Erdos-Renyi\"\n",
+    "    print(G.name)\n",
+    "    print(\"Number of nodes: \", G.number_of_nodes())\n",
+    "    print(\"Number of edges: \", G.number_of_edges())\n",
+    "    degree_distribution(G)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for graph in friendships_graph:\n",
+    "    G = nx.erdos_renyi_graph(graph.number_of_nodes(), graph.number_of_nodes()/graph.number_of_edges())\n",
+    "    G.name = graph.name + \" Erdos-Renyi\"\n",
+    "    print(G.name)\n",
+    "    print(\"Number of nodes: \", G.number_of_nodes())\n",
+    "    print(\"Number of edges: \", G.number_of_edges())\n",
+    "    degree_distribution(G)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is a Poissonian distribution, as expected."
   ]
  },
  {
@ -715,7 +1051,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Small-Worldness"
+    "The degree distribution alone is not enough to characterize the network. There are many other quantities, such as the degree-degree correlation (between connected nodes), the spatial correlations, the clustering coefficient, the betweenness or central-ity distribution, and the self-similarity exponents."
   ]
  }
 ],
@ -735,7 +1071,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]"
  },
  "orig_nbformat": 4,
  "vscode": {
--- a/testing.ipynb
+++ b/testing.ipynb
@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@ -28,9 +28,170 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Graph</th>\n",
+       "      <th>Number of Nodes</th>\n",
+       "      <th>Number of Edges</th>\n",
+       "      <th>Average Degree</th>\n",
+       "      <th>Average Clustering Coefficient</th>\n",
+       "      <th>log N</th>\n",
+       "      <th>Average Shortest Path Length</th>\n",
+       "      <th>betweenness centrality</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Brightkite Checkins Graph</td>\n",
+       "      <td>7191</td>\n",
+       "      <td>3663807</td>\n",
+       "      <td>1018.997914</td>\n",
+       "      <td>0.702854</td>\n",
+       "      <td>8.880586</td>\n",
+       "      <td>2.411011</td>\n",
+       "      <td>0.00022</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Gowalla Checkins Graph</td>\n",
+       "      <td>10702</td>\n",
+       "      <td>303104</td>\n",
+       "      <td>56.644366</td>\n",
+       "      <td>0.505597</td>\n",
+       "      <td>9.278186</td>\n",
+       "      <td>5.222903</td>\n",
+       "      <td>0.000301</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Foursquare EU Checkins Graph</td>\n",
+       "      <td>20282</td>\n",
+       "      <td>7430376</td>\n",
+       "      <td>732.706439</td>\n",
+       "      <td>0.597097</td>\n",
+       "      <td>9.917489</td>\n",
+       "      <td>2.2843</td>\n",
+       "      <td>0.000089</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Foursquare IT Checkins Graph</td>\n",
+       "      <td>3730</td>\n",
+       "      <td>629749</td>\n",
+       "      <td>337.667024</td>\n",
+       "      <td>0.683565</td>\n",
+       "      <td>8.224164</td>\n",
+       "      <td>2.185477</td>\n",
+       "      <td>0.000428</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Brightkite Friendship Graph</td>\n",
+       "      <td>5928</td>\n",
+       "      <td>34673</td>\n",
+       "      <td>11.698043</td>\n",
+       "      <td>0.219749</td>\n",
+       "      <td>8.687442</td>\n",
+       "      <td>5.052162</td>\n",
+       "      <td>0.000448</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>(Filtered) Gowalla Friendship Graph</td>\n",
+       "      <td>8396</td>\n",
+       "      <td>29122</td>\n",
+       "      <td>6.937113</td>\n",
+       "      <td>0.217544</td>\n",
+       "      <td>9.035511</td>\n",
+       "      <td>4.558532</td>\n",
+       "      <td>0.000357</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Foursquare IT Friendship Graph</td>\n",
+       "      <td>2073</td>\n",
+       "      <td>6217</td>\n",
+       "      <td>5.99807</td>\n",
+       "      <td>0.148489</td>\n",
+       "      <td>7.636752</td>\n",
+       "      <td>19.530752</td>\n",
+       "      <td>0.000879</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Foursquare EU Friendship Graph</td>\n",
+       "      <td>16491</td>\n",
+       "      <td>59419</td>\n",
+       "      <td>7.206234</td>\n",
+       "      <td>0.167946</td>\n",
+       "      <td>9.710570</td>\n",
+       "      <td>23.713864</td>\n",
+       "      <td>0.000272</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                 Graph Number of Nodes Number of Edges  \\\n",
+       "0            Brightkite Checkins Graph            7191         3663807   \n",
+       "1               Gowalla Checkins Graph           10702          303104   \n",
+       "2         Foursquare EU Checkins Graph           20282         7430376   \n",
+       "3         Foursquare IT Checkins Graph            3730          629749   \n",
+       "4          Brightkite Friendship Graph            5928           34673   \n",
+       "5  (Filtered) Gowalla Friendship Graph            8396           29122   \n",
+       "6       Foursquare IT Friendship Graph            2073            6217   \n",
+       "7       Foursquare EU Friendship Graph           16491           59419   \n",
+       "\n",
+       "  Average Degree Average Clustering Coefficient     log N  \\\n",
+       "0    1018.997914                       0.702854  8.880586   \n",
+       "1      56.644366                       0.505597  9.278186   \n",
+       "2     732.706439                       0.597097  9.917489   \n",
+       "3     337.667024                       0.683565  8.224164   \n",
+       "4      11.698043                       0.219749  8.687442   \n",
+       "5       6.937113                       0.217544  9.035511   \n",
+       "6        5.99807                       0.148489  7.636752   \n",
+       "7       7.206234                       0.167946  9.710570   \n",
+       "\n",
+       "  Average Shortest Path Length betweenness centrality  \n",
+       "0                     2.411011                0.00022  \n",
+       "1                     5.222903               0.000301  \n",
+       "2                       2.2843               0.000089  \n",
+       "3                     2.185477               0.000428  \n",
+       "4                     5.052162               0.000448  \n",
+       "5                     4.558532               0.000357  \n",
+       "6                    19.530752               0.000879  \n",
+       "7                    23.713864               0.000272  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
    "# import the graphs from the saved files\n",
    "G_brighkite_checkins = nx.read_gpickle(os.path.join('data', 'brightkite', 'brightkite_checkins_graph.gpickle'))\n",
@ -41,7 +202,19 @@
    "G_brighkite_friends = nx.read_gpickle(os.path.join('data', 'brightkite', 'brightkite_friendships_graph.gpickle'))\n",
    "G_gowalla_friends = nx.read_gpickle(os.path.join('data', 'gowalla', 'gowalla_friendships_graph.gpickle'))\n",
    "G_foursquareEU_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareEU_friendships_graph.gpickle'))\n",
-    "G_foursquareIT_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareIT_friendships_graph.gpickle'))"
+    "G_foursquareIT_friends = nx.read_gpickle(os.path.join('data', 'foursquare', 'foursquareIT_friendships_graph.gpickle'))\n",
+    "\n",
+    "# open the dataframe object\n",
+    "analysis_results = pd.read_pickle('analysis_results.pkl')\n",
+    "analysis_results"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The first thing that we want to do is very simple, create a random reference for each graph"
   ]
  }
 ],
--- a/utils.py
+++ b/utils.py
@ -537,3 +537,112 @@ def average_clustering_coefficient(G: nx.Graph, k=None) -> float:
                G_copy.remove_nodes_from(random.sample(list(G_copy.nodes()), int((k)*G_copy.number_of_nodes())))
                print("\tNumber of nodes after removing {}% of nodes: {}" .format((k)*100, G_copy.number_of_nodes()))
                return nx.average_clustering(G_copy)
+
+# ------------------------------------------------------------------------#
+
+
+def create_random_graphs(G: nx.Graph, model = None, save = True) -> nx.Graph:
+
+    """Create a random graphs of the same model of the original graph G.
+
+    Parameters
+    ----------
+    G : nx.Graph
+        The original graph.
+    model : str
+        The model to use to generate the random graphs. It can be one of the following: "erdos", "barabasi", "watts_strogatz", "newman_watts_strog
+    save: bool
+        If True, the random graph is saved in the folder data/random/model
+
+    Returns
+    -------
+    G_random : nx.Graph
+
+    """
+
+    if model is None:
+        model = "erdos"
+
+    if model == "erdos":
+        G_random = nx.erdos_renyi_graph(G.number_of_nodes(), nx.density(G))
+        print("\tNumber of edges in the original graph: {}" .format(G.number_of_edges()))
+        print("\tNumber of edges in the random graph: {}" .format(G_random.number_of_edges()))
+        G_random.name = G.name + " erdos"
+
+        if save:
+            # check if the folder exists, otherwise create it
+            if not os.path.exists(os.path.join('data', 'random', 'erdos')):
+                os.makedirs(os.path.join('data', 'random', 'erdos'))
+
+            nx.write_gpickle(G_random, os.path.join('data', 'random', 'erdos', "erdos_" + str(G.number_of_nodes()) + "_" + str(G_random.number_of_edges()) + ".gpickle"))
+            print("\tThe file graph has been saved in the folder data/random/erdos with the syntax erdos_n_nodes_n_edges.gpickle")
+
+        return G_random
+
+    elif model == "watts_strogatz":
+        p = G.number_of_edges() / (G.number_of_nodes())
+        avg_degree = int(np.mean([d for n, d in G.degree()]))
+        G_random = nx.watts_strogatz_graph(G.number_of_nodes(), avg_degree, p)
+        print("\tNumber of edges in the original graph: {}" .format(G.number_of_edges()))
+        print("\tNumber of edges in the random graph: {}" .format(G_random.number_of_edges()))
+        G_random.name = G.name + " watts_strogatz"
+
+        if save:
+            # check if the folder exists, otherwise create it
+            if not os.path.exists(os.path.join('data', 'random', 'watts_strogatz')):
+                os.makedirs(os.path.join('data', 'random', 'watts_strogatz'))
+
+            nx.write_gpickle(G_random, os.path.join('data', 'random', 'watts_strogatz', "watts_strogatz_" + str(G.number_of_nodes()) + "_" + str(G_random.number_of_edges()) + ".gpickle"))
+            print("\tThe file graph has been saved in the folder data/random/watts_strogatz with the syntax watts_strogatz_n_nodes_n_edges.gpickle")
+
+        return G_random
+
+    # elif model == "regular":
+    #     G_random = nx.random_regular_graph(1, G.number_of_nodes())
+    #     print("\tNumber of edges in the original graph: {}" .format(G.number_of_edges()))
+    #     print("\tNumber of edges in the random graph: {}" .format(G_random.number_of_edges()))
+    #     G_random.name = G.name + "regular"
+
+    #     if save:
+    #         # check if the folder exists, otherwise create it
+    #         if not os.path.exists(os.path.join('data', 'random', 'regular')):
+    #             os.makedirs(os.path.join('data', 'random', 'regular'))
+
+    #         nx.write_gpickle(G_random, os.path.join('data', 'random', 'regular', "regular_" + str(G.number_of_nodes()) + "_" + str(G_random.number_of_edges()) + ".gpickle"))
+    #         print("\tThe file graph has been saved in the folder data/random/regular with the syntax regular_n_nodes_n_edges.gpickle")
+
+        # return G_random
+
+    # elif model == "reference":
+    #     G_random = nx.random_reference(G)
+    #     print("\tNumber of edges in the original graph: {}" .format(G.number_of_edges()))
+    #     print("\tNumber of edges in the random graph: {}" .format(G_random.number_of_edges()))
+    #     G_random.name = G.name + "reference"
+
+    #     if save:
+    #         # check if the folder exists, otherwise create it
+    #         if not os.path.exists(os.path.join('data', 'random', 'reference')):
+    #             os.makedirs(os.path.join('data', 'random', 'reference'))
+
+    #         nx.write_gpickle(G_random, os.path.join('data', 'random', 'reference', "reference_" + str(G.number_of_nodes()) + "_" + str(G_random.number_of_edges()) + ".gpickle"))
+    #         print("\tThe file graph has been saved in the folder data/random/reference with the syntax reference_n_nodes_n_edges.gpickle")
+
+    #     return G_random
+
+    # #lattice
+    # elif model == "lattice":
+    #     G_random = nx.lattice_reference(G, 1)
+    #     print("\tNumber of edges in the original graph: {}" .format(G.number_of_edges()))
+    #     print("\tNumber of edges in the random graph: {}" .format(G_random.number_of_edges()))
+    #     G_random.name = G.name + "lattice"
+
+
+    #     if save:
+    #         # check if the folder exists, otherwise create it
+    #         if not os.path.exists(os.path.join('data', 'random', 'lattice')):
+    #             os.makedirs(os.path.join('data', 'random', 'lattice'))
+
+    #         nx.write_gpickle(G_random, os.path.join('data', 'random', 'lattice', "lattice_" + str(G.number_of_nodes()) + "_" + str(G_random.number_of_edges()) + ".gpickle"))
+    #         print("\tThe file graph has been saved in the folder data/random/lattice with the syntax lattice_n_nodes_n_edges.gpickle")
+
+    #     return G_random