From 3b4d0e3200877bb9dcac741e3caa11f549419ad6 Mon Sep 17 00:00:00 2001 From: Luca Lombardo Date: Sun, 12 Feb 2023 22:34:05 +0100 Subject: [PATCH] final version to send --- main.ipynb | 2217 +++++++++++++++++++------------------- omega_parallel_server.py | 175 +-- omega_sampled_server.py | 97 +- utils.py | 92 +- 4 files changed, 1288 insertions(+), 1293 deletions(-) diff --git a/main.ipynb b/main.ipynb index 0e81790..e7ad565 100644 --- a/main.ipynb +++ b/main.ipynb @@ -9,12 +9,7 @@ "%load_ext autoreload\n", "%autoreload 2\n", "\n", - "import os\n", "import time\n", - "import wget\n", - "import numpy as np\n", - "import pandas as pd\n", - "import networkx as nx\n", "import geopandas as gpd\n", "from utils import *\n", "\n", @@ -23,24 +18,79 @@ "warnings.filterwarnings(\"ignore\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Table of contents** \n", + "- [Aim of the project](#toc1_1_) \n", + "- [Introduction: theoretical background](#toc2_) \n", + " - [Definition of a random graph](#toc2_1_) \n", + " - [ Erdős-Rényi graphs](#toc2_2_) \n", + " - [Scale-free networks](#toc2_3_) \n", + " - [Diameter and fractal dimension (infinite-dimensional networks)](#toc2_3_1_) \n", + " - [Watts-Strogatz model](#toc2_4_) \n", + " - [Random graphs as a model of real networks](#toc2_5_) \n", + "- [Discovering the datasets](#toc3_) \n", + " - [Brightkite](#toc3_1_1_) \n", + " - [Gowalla](#toc3_1_2_) \n", + " - [Foursquare](#toc3_1_3_) \n", + " - [Building the networks](#toc3_2_) \n", + " - [Check-ins networks](#toc3_2_1_) \n", + " - [Friendship network](#toc3_2_2_) \n", + "- [Properties of the networks](#toc4_) \n", + " - [Average Degree](#toc4_1_) \n", + " - [Clustering coefficient](#toc4_2_) \n", + " - [Average Path Length](#toc4_3_) \n", + " - [Betweenness Centrality](#toc4_4_) \n", + " - [Download the dataframe with accurate results](#toc4_5_) \n", + "- [Analysis of the results](#toc5_) \n", + " - [Distribution of Degree](#toc5_1_1_) \n", + " - [The Small-World Model](#toc5_2_) \n", + " - [Small-Worldness](#toc5_3_) \n", + " - [Identifying small-world networks](#toc5_4_) \n", + " - [A first approach: the $\\sigma$ coefficient](#toc5_4_1_) \n", + " - [A more solid approach: the $\\omega$ coefficient](#toc5_4_2_) \n", + " - [Lattice network construction](#toc5_4_2_1_) \n", + " - [Limitations](#toc5_4_2_2_) \n", + " - [Omega coefficient computation: standard procedure](#toc5_4_3_) \n", + " - [Omega coefficient computation: parallelization approach (experimental)](#toc5_4_4_) \n", + " - [Why experimental?](#toc5_4_4_1_) \n", + " - [Are our networks small-world?](#toc5_5_) \n", + " - [Degree distribution](#toc5_6_) \n", + " - [Betweenness centrality](#toc5_7_) \n", + " - [Clustering coefficient](#toc5_8_) \n", + " - [Conclusions: the omega coefficient](#toc5_9_) \n", + "- [References](#toc6_) \n", + "\n", + "\n", + "" + ] + }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Aim of the project\n", + "## [Aim of the project](#toc0_)\n", "\n", - "Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. \n", + "\n", "\n", - "---\n", + "This project aims to study the small-world phenomenon in the context of geographical social networks. We will consider a large number of centrality measures, and we will try to understand how the small-world phenomenon manifests itself in each of them. We will also try to understand how the small-world phenomenon is affected by the choice of centrality measure and how peculiar topologies of a network affects our results.\n", "\n", - "We aim to study the small-world phenomenon in the context of social networks, and to do so we will consider a large number of centrality measures. We will use 3 real-world datasets, trying to understand how the small-world phenomenon manifests itself in each of them. We will also try to understand how the small-world phenomenon is affected by the choice of centrality measure." + "Generally speaking, a small-world network is a graph where the average distance between nodes is logarithmic in the size of the network, whereas the clustering coefficient is larger (that is, neighborhoods tend to be denser) than in a random Erdős-Rényi graph with the same size and average distance. The fact that social networks (whether electronically mediated or not) exhibit the small-world property is known at least since Milgram's famous experiment and is arguably the most popular of all features of complex networks. For instance, the average distance of the Facebook graph was recently established to be just $4.74$." ] }, { @@ -48,64 +98,82 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Random Networks: The Erdős-Rényi model\n", - "\n", - "\n", + "# [Introduction: theoretical background](#toc0_)\n", "\n", - "Prior to the 1960s, graph theory primarily focused on the characteristics of individual graphs. In the 1960s, Paul Erdős and Alfred Rényi introduced a systematic approach to studying random graphs, which involves analyzing a collection, or ensemble, of many different graphs. Each graph in the ensemble is assigned a probability, and a property is said to hold with probability $P$ if the total probability of the graphs in the ensemble possessing that property is $P$, or if the fraction of graphs in the ensemble with the property is $P$. This method allows for the application of probability theory in conjunction with discrete math to study ensembles of graphs. A property is considered to hold for a class of graphs if the fraction of graphs in the ensemble without the property has zero measure, which is typically referred to as being true for \"almost every\" graph in the ensemble. `[2]`\n", + "Prior to the 1960s, graph theory primarily focused on the characteristics of individual graphs. In the 1960s, Paul Erdős and Alfred Rényi introduced a systematic approach to study random graphs, which involves analyzing a collection, or ensemble, of many different graphs. \n", "\n", - "## Definition of a random graph\n", - "\n", - "Let $E_{n,N}$ denote the set of alla graphs having $n$ given labelled vertices $V_1,V_2, \\dots, V_n$ and $N$ edges [1]. The graphs considered are supposed to be not oriented, without parallel edges and without slings. Thus a graph belonging to $E_{n,N}$ is obtained by choosing $N$ out of the $\\binom{n}{2}$ possible edges between the points $V_1,V_2, \\dots, V_n$, and therefore the number of elements of $E_{n,N}$ is given by the binomial coefficient $\\binom{\\binom{n}{2}}{N}$. \n", - "\n", - "A random graph $\\Gamma_{n,N}$ can be defined as a element of $E_{n,N}$ chosen at random, so that each of the elements of $E_{n,N}$ has the same probability of being chosen, namely $\\frac{1}{\\binom{\\binom{n}{2}}{N}}$.\n", + "Each graph in the ensemble is assigned a probability, and a property is said to hold with probability $P$ if the total probability of the graphs in the ensemble possessing that property is $P$, or if the fraction of graphs in the ensemble with the property is $P$. This method allows for the application of probability theory in conjunction with discrete math to study ensembles of graphs. A property is considered to hold for a class of graphs if the fraction of graphs in the ensemble without the property has zero measure, which is typically referred to as being true for \"almost every\" graph in the ensemble. `[2]`" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [Definition of a random graph](#toc0_)\n", "\n", - "Let's try to modify this point of view and use a bit of probability theory. _We may consider the formation of a random graph as a stochastic process_ defined as follows: At time $t=1$ we choose out of the $\\binom{n}{2}$ possible edges between the points $V_1,V_2, \\dots, V_n$ $N$ edges, each of this edges having the same probability of being chosen; let this edge be denoted as $e_1$. At time $t=2$ we choose one of the possible $\\binom{n}{2} -1$, different from $e_1$, all this being equiprobable. Continuing this process at time $t=k+1$ we choose one of the possible $\\binom{n}{2} -k$, different from $e_1, e_2, \\dots, e_k$, all this being equiprobable, i.e having the probability $\\frac{1}{\\binom{n}{2} -k}$. We denote $\\Gamma_{n,N}$ the graph obtained by choosing $N$ edges in this way.\n", + "Let $E_{n,N}$ stand for the collection of graphs that have $n$ designated labeled vertices $V_1, V_2, \\dots, V_n$ and $N$ edges. The graphs in consideration are undirected, have no parallel edges and do not loop. Therefore, a graph within $E_{n,N}$ can be obtained by selecting $N$ from the $\\binom{n}{2}$ possible connections between $V_1, V_2, \\dots, V_n$. As a result, the cardinality of $E_{n,N}$ is calculated by the binomial coefficient $\\binom{\\binom{n}{2}}{N}$.\n", "\n", - "> NOTE: the two definitions are equivalent, but the second one is more convenient for the study of the properties of random graphs. According to this interpretation we may study the evolution of random graphs, i.e. the step-by-step unraveling of the structure of the graph when $N$ increases. This will be an essential point in our study of the properties of small-worldness.\n", + "A random graph $\\Gamma_{n,N}$ can be described as a random choice from $E_{n,N}$ such that every element of $E_{n,N}$ has an equal probability of being selected, which is $\\frac{1}{\\binom{\\binom{n}{2}}{N}}$.\n", "\n", + "Let's take a different approach and use some probability theory. We can consider the formation of a random graph as a stochastic process. This process is defined as follows: at time $t=1$, we pick $N$ edges out of the $\\binom{n}{2}$ potential connections between $V_1, V_2, \\dots, V_n$, all with the same probability of being chosen; we designate this edge as $e_1$. At time $t=2$, we choose one of the $\\binom{n}{2}-1$ remaining possibilities, excluding $e_1$, each with equal probability. This process continues at time $t=k+1$ where we choose one of the remaining $\\binom{n}{2} - k$ possibilities, excluding $e_1, e_2, \\dots, e_k$, each with probability $\\frac{1}{\\binom{n}{2} - k}$. The graph obtained by selecting $N$ edges in this manner is denoted as $\\Gamma_{n,N}$.\n", "\n", - "## Erdős-Rényi graphs\n", "\n", - "There are two well-known ensembles of graphs that have been extensively studied: the ensemble of all graphs with $N$ nodes and $M$ edges, denoted $G_{N,M}$, and the ensemble of all graphs with $N$ nodes and a probability $p$ of any two nodes being connected, denoted $G_{N,p}$. These ensembles, initially studied by Erdős and Rényi, are similar when $M = \\binom{N}{2} p$, and are therefore referred to as ER graphs when $p$ is not too close to $0$ or $1$. `[2]`\n", + "> NOTE: the two definitions are equivalent, but the second one is more convenient for the study of the properties of random graphs. According to this interpretation we may study the evolution of random graphs, i.e. the step-by-step unraveling of the structure of the graph when $N$ increases. This will be an essential point in our study of the properties of small-worldness." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [ Erdős-Rényi graphs](#toc0_)\n", "\n", - "An important feature of a graph is its average degree, or the average number of edges connected to each node. We will denote the degree of the $i$-th node by $k_i$ and the average degree by $\\langle r \\rangle$. Graphs with $N$ nodes and $\\langle k \\rangle = O(N^0)$ are called sparse graphs.\n", + "There are two famous groups of graphs that have been extensively analyzed: the set of all graphs with $N$ nodes and $M$ edges, referred to as $G_{N,M}$, and the set of all graphs with $N$ nodes and a connection probability of $p$ between any two nodes, denoted $G_{N,p}$. These sets, first studied by Erdős and Rényi, are alike when $M = \\binom{N}{2} p$ and are therefore referred to as ER graphs if $p$ is not too close to either 0 or 1. `[2]`\n", "\n", - "One interesting property of the ensemble $G_{N,p}$ is that many of its characteristics have a corresponding threshold function, $p_t(N)$, such that the property exists with probability 0 if $p < p_t$ and with probability 1 if $p > p_t$ in the \"thermodynamic limit\" of $N \\to \\infty$. This is similar to the physical concept of a percolation phase transition.\n", + "A crucial aspect of a graph is its average degree, or the average number of edges attached to each node. The degree of the $i$-th node is represented by $k_i$ and the average degree is denoted by $\\langle k \\rangle$. Graphs with $N$ nodes and $\\langle k \\rangle = O(N^0)$ are called sparse graphs.\n", "\n", - "Another property of interest is the average path length between any two nodes, which is typically of order $\\ln N$ in almost every graph of the ensemble (with $\\langle k \\rangle > 1$ and finite). This small, logarithmic distance is the source of the \"small-world\" phenomena that are characteristic of networks.\n", + "One fascinating property of the $G_{N,p}$ set is that several of its features have a corresponding threshold function, $p_t(N)$, such that the property is present with probability 0 if $p < p_t$ and with probability 1 if $p > p_t$ in the \"thermodynamic limit\" of $N \\to \\infty$. This is similar to the physical concept of a percolation phase transition.\n", "\n", - "## Scale-free networks\n", + "> Check out this [impressive paper](https://arxiv.org/abs/2203.17207) that showcases properties of random graphs and their threshold functions. In this concise 6-page article, the Kahn-Kalai Conjecture is proven in a highly elegant manner.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [Scale-free networks](#toc0_)\n", "\n", - "The Erdős-Rényi model `[4]` has long been the primary focus of research in the field of random graphs. However, recent studies of real-world networks have shown that the ER model does not accurately capture many of their observed properties. One such property that can be easily measured is the degree distribution, or the fraction $P(k)$ of nodes with $k$ connections (degree $k$). A well-known result for ER networks is that the degree distribution follows a Poisson distribution, given by `[2]`\n", + "The Erdős-Rényi model `[4]` has long been the primary focus of research in the field of random graphs. However, recent studies of real-world networks have shown that the ER model does not accurately capture many of their observed properties. One such property that can be easily measured is the degree distribution, or the fraction $P(k)$ of nodes with $k$ connections (degree $k$). A well-known result `[2]` for ER networks is that the degree distribution follows a Poisson distribution, given by \n", "\n", - "\\begin{equation}\n", - "P(k) = \\frac{e^{z} z^k}{k!}\n", - "\\end{equation}\n", + "$$ P(k) = \\frac{e^{z} z^k}{k!} $$\n", "\n", "where $z = \\langle k \\rangle$ is the average degree `[13]`. However, measurements of the degree distribution for real networks often show that the Poisson law does not hold, instead exhibiting a scale-free degree distribution of the form\n", "\n", - "\\begin{equation}\n", - "P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K\n", - "\\end{equation}\n", + "$$ P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K $$\n", "\n", "where $c \\sim (\\gamma -1)m^{\\gamma - 1}$ is a normalization factor, and $m$ and $K$ are the lower and upper cutoffs for the degree of a node, respectively. The divergence of moments higher than $\\lceil \\gamma -1 \\rceil$ (as $K \\to \\infty$ when $N \\to \\infty$) is responsible for many of the unusual properties attributed to scale-free networks.\n", "\n", "It is important to note that all real-world networks are finite, so all of their moments are finite as well. The actual value of the cutoff $K$ plays a significant role, and can be approximated by noting that the total probability of nodes with $k > K$ is approximately $1/N$ `[14]`, or\n", "\n", - "\\begin{equation}\n", - "\\int_K^\\infty P(k) dk \\sim \\frac{1}{N}\n", - "\\end{equation}\n", + "$$ \\int_K^\\infty P(k) dk \\sim \\frac{1}{N} $$\n", "\n", "This gives the result\n", "\n", - "\\begin{equation}\n", - "K \\sim m N^{1/(\\gamma -1)}\n", - "\\end{equation}\n", + "$$ K \\sim m N^{1/(\\gamma -1)} $$\n", "\n", - "The degree distribution is not the only characteristic that can be used to describe a network. Other quantities, such as the degree-degree correlation (between connected nodes), spatial correlations, clustering coefficient, betweenness or centrality distribution, and self-similarity exponents, can also provide insight into the network's structure and behavior.\n", + "The degree distribution is not the only characteristic that can be used to describe a network. Other quantities, such as the degree-degree correlation (between connected nodes), spatial correlations, clustering coefficient, betweenness or centrality distribution, and self-similarity exponents, can also provide insight into the network's structure and behavior." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [Diameter and fractal dimension (infinite-dimensional networks)](#toc0_)\n", "\n", - "# Diameter and fractal dimension\n", + "> This section in only a brief introduction to the topic and a bit unrelated to the rest of the project. I decided to include it anyway since it gives a more complete picture when we will introduce the Watts-Strogatz model. However, it can be skipped without any loss of information.\n", "\n", "Regular lattices can be viewed as networks embedded in Euclidean space of a defined dimension $d$, meaning that $n(r)$, the number of nodes within a distance $r$ from an origin, grows as $n(r) \\sim r^d$ for large $r$. For fractal objects, the dimension $d$ in this relation may be a non-integer and is replaced by the fractal dimension $d_f$. `[2]`\n", "\n", @@ -117,19 +185,31 @@ "\n", "Many random network models have locally tree-like structure (since most loops occur only when $n(l) \\sim N$), and since the number of nodes grows as $n(l) \\sim \\langle k - 1 \\rangle^l$, they are also infinite dimensional. As a result, the diameter of such graphs (i.e., the shortest path between the most distant nodes) scales as $D \\sim \\ln N$ `[13]`. Many properties of ER networks, including the logarithmic diameter, are also present in Cayley trees. This small diameter is in contrast to that of finite-dimensional lattices, where $D \\sim N^{1/d_l}$.\n", "\n", - "Like ER networks, percolation on infinite-dimensional lattices and the Cayley tree exhibits a critical threshold $p_c = 1/(z - 1)$. For $p > p_c$, a \"giant cluster\" of size $N$ exists, while for $p < p_c$, only small clusters are present. At criticality ($p = p_c$) in infinite-dimensional lattices (similar to ER networks), the giant component is of size $N^{2/3}$. This result follows from the fact that percolation on lattices in dimension $d \\geq d_c = 6$ is in the same universality class as infinite-dimensional percolation, where the fractal dimension of the giant cluster is $d_f = 4$, resulting in a size of the giant cluster that scales as $N^{d_f/d_c} = N^{2/3}$. The dimension $d_c$ is known as the \"upper critical dimension,\" and this concept exists not only in percolation phenomena, but also in other physical models such as the self-avoiding walk model for polymers and the Ising model for magnetism, in both of which $d_c = 4$. `[2]`\n", - "\n", - "### Watts-Strogatz model\n", + "Like ER networks, percolation on infinite-dimensional lattices and the Cayley tree exhibits a critical threshold $p_c = 1/(z - 1)$. For $p > p_c$, a \"giant cluster\" of size $N$ exists, while for $p < p_c$, only small clusters are present. At criticality ($p = p_c$) in infinite-dimensional lattices (similar to ER networks), the giant component is of size $N^{2/3}$. This result follows from the fact that percolation on lattices in dimension $d \\geq d_c = 6$ is in the same universality class as infinite-dimensional percolation, where the fractal dimension of the giant cluster is $d_f = 4$, resulting in a size of the giant cluster that scales as $N^{d_f/d_c} = N^{2/3}$. The dimension $d_c$ is known as the \"upper critical dimension,\" and this concept exists not only in percolation phenomena, but also in other physical models such as the self-avoiding walk model for polymers and the Ising model for magnetism, in both of which $d_c = 4$. `[2]`" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [Watts-Strogatz model](#toc0_)\n", "\n", "In the year 1998, Watts and Strogatz presented a novel model for small-world networks in their seminal work `[3]`. This model preserves the high degree of local clustering, which is a characteristic of lattice structures where the neighbors of a node are more likely to be neighbors with each other than in random graphs. The model achieves a reduction in the diameter of the network to $D \\sim \\ln N$ by randomly rewiring a fraction $\\varphi$ of the links in a regular lattice to connect to distant nodes. The rewiring procedure is based on a probability $p$ assigned to each edge. If an edge is selected for rewiring, it is substituted with a new edge chosen at random with uniform probability. The resulting network is characterized by $N$ nodes, $k$ nearest neighbors, and an average distance of $\\log(N)/\\log(k)$.\n", "\n", - "More details on the Watts-Strogatz model can be found in `[3]`.\n", - "\n", - "## Random graphs as a model of real networks\n", + "More details on the Watts-Strogatz model can be found in `[3]`." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [Random graphs as a model of real networks](#toc0_)\n", "\n", "Many physical and man-made systems can be represented as networks, which consist of objects and the interactions between them. Some examples include computer networks, such as the Internet, and logical networks, including the links between web pages and email networks, where the presence of an individual's address in another person's address book is represented by a link. Additionally, social interactions in populations or work relationships and movements of a system in a configuration space can also be described using a network. These examples, along with many others, possess a graph structure that can be studied. Although many of these networks exhibit some ordered structure, such as cluster and group formation, geographical or geometrical considerations, or specific properties, most of them possess complex and random structures that deviate from regular lattices. As a result, it is often assumed, with caution, that they share properties with random graph models.\n", "\n", - "Scale-free networks can be considered as a generalization of Erdős-Rényi (ER) networks. When $\\gamma > 4$ for large $\\gamma$, the properties of scale-free networks, such as distances, optimal paths, and percolation, are similar to those in ER networks. Conversely, when $\\gamma < 4$, these properties exhibit anomalous behavior due to the strong heterogeneity in the degrees of nodes, which disrupts the node-to-node translational homogeneity (symmetry) present in classical homogeneous networks such as lattices, Cayley trees, and ER graphs. `[2]`\n", + "Scale-free networks can be considered as a generalization of Erdős-Rényi (ER) networks. When $\\gamma > 4$, the properties of scale-free networks, such as distances, optimal paths, and percolation, are similar to those in ER networks. Conversely, when $\\gamma < 4$, these properties exhibit anomalous behavior due to the strong heterogeneity in the degrees of nodes, which disrupts the node-to-node translational homogeneity (symmetry) present in classical homogeneous networks such as lattices, Cayley trees, and ER graphs. `[2]`\n", "\n", "---" ] @@ -139,7 +219,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this section of the notebook, I will delve into the technical aspect of analyzing the properties of real-world networks that were discussed previously. I will make use of the networkx library, a Python-based tool for constructing, manipulating, and studying the structure, dynamics, and functions of complex networks. However, some algorithms required manual implementation and can be found in the utils.py file for further information.\n", + "In this section of the notebook, we'll delve into the technical aspect of analyzing the properties of real-world networks that were discussed previously. We will try to make as much usage as possibile of the [networkx](https://networkx.org/) library. However, some algorithms required manual implementation and can be found in the `utils.py` module that I strongly recommend to read.\n", "\n", "The computations were executed on an Arch Linux machine with a AMD Ryzen 5 2600 processor (6 cores and 12 threads) and 16 GB of RAM. The code was written in Python 3.10.9, and the required packages can be installed by executing the following command in the terminal:\n", "\n", @@ -147,15 +227,14 @@ "pip3 install -r requirements.txt\n", "```\n", "\n", - "I have made efforts to ensure that the code is widely compatible, but I was unable to test it on a Windows machine. In the event that any issues are encountered, please ~~install Linux~~ inform me so I can work towards resolving them." + "I have made efforts to ensure that the code is widely compatible, but I was unable to test it on a Windows or MacOS machine since I have neither of those operating systems installed on any of my personal devices. In the event that any issues are encountered, please inform me so I can work towards resolving them." ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "# Discovering the datasets\n", + "# [Discovering the datasets](#toc0_)\n", "\n", "To perform our analysis, we will use the following datasets:\n", "\n", @@ -179,14 +258,9 @@ " └── gowalla_friends_edges.txt\n", "```\n", "\n", - "If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n", - "\n", - "> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 3 minutes to download all the datasets.\n", + "For further details about the function below, please refer to the `utils` module.\n", "\n", - "---\n", - "\n", - "### A deeper look at the datasets\n", - "\n" + "> **NOTE:** the Stanford servers tends to be slow, so it may take a few minutes to download them." ] }, { @@ -246,17 +320,16 @@ "source": [ "# this is a long and boring function to automatically download, extract, rename and save the datasets in a clean way. If you want to have a deeper look at the code, you can find it in utils.py\n", "\n", - "download_datasets() # it takes about 3-4 minutes to download and extract the datasets with a fiber connection" + "download_datasets() # it takes about 3-4 minutes to download and extract the datasets with a fiber connection\n", + "\n", + "## If you want to run it again, delete the data folder and run the function again" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Let's have a deeper look at them.\n", - "\n", - "## Brightkite\n", + "### [Brightkite](#toc0_)\n", "\n", "[Brightkite](http://www.brightkite.com/) was a location-based social networking service that allowed users to share their locations by checking in. The friendship network data was collected using the Brightkite public API. There are two datasets available for analysis: \n", "\n", @@ -293,7 +366,7 @@ }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -310,9 +383,9 @@ " header=None,\n", " names=['user id', 'check-in time', 'latitude', 'longitude', 'location id'],\n", " parse_dates=['check-in time'],\n", - " engine='pyarrow')\n", + " engine='pyarrow') # multi-threaded engine (sometimes it's faster)\n", "\n", - "# take only the dates from 2009\n", + "# take only data from 2009\n", "df_brighkite = df_brighkite[df_brighkite['check-in time'].dt.year == 2009]\n", "\n", "# convert the dataframe to geopandas dataframe\n", @@ -345,7 +418,7 @@ }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -397,11 +470,10 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Gowalla\n", + "### [Gowalla](#toc0_)\n", "\n", "Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets. This is how they look like after being filtered by the `download_dataset` function:\n", "\n", @@ -409,7 +481,6 @@ "\n", "- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This file it's in the form of a graph edge list. \n", "\n", - "--- \n", "\n", "Let's have a more clear view of where our data have been generated" ] @@ -428,7 +499,7 @@ }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -445,7 +516,7 @@ " parse_dates=['check-in time'],\n", " engine='pyarrow')\n", "\n", - "# take only the dates from 2009\n", + "# take only data from 2009\n", "df_gowalla = df_gowalla[df_gowalla['check-in time'].dt.year == 2009]\n", "\n", "# convert the dataframe to geopandas dataframe\n", @@ -461,7 +532,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This is still a bit too much, to help us in the next sections, let's take a subset of the European area" + "This is still a bit too much, let's take a subset of the European area" ] }, { @@ -478,7 +549,7 @@ }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -527,13 +598,12 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Foursquare\n", + "### [Foursquare](#toc0_)\n", "\n", - "[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 22 months from Apr. 2012 to Jan. 2014) global-scale check-in data collected from Foursquare, and also two snapshots of user social networks before and after the check-in data collection period (see more details in the reference paper). We will work with three different datasets `[15]`:\n", + "[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 22 months from Apr. 2012 to Jan. 2014) global-scale check-in data collected from Foursquare, and also two snapshots of user social networks before and after the check-in data collection period (see more details in the reference paper `[15]`). We will work with three different datasets:\n", "\n", "- `foursquare_checkins_full.txt`: a tsv file with 4 columns: `User ID`, `Venue ID`, `UTC time`, `Timezone offset in minutes` \n", "\n", @@ -543,7 +613,7 @@ "\n", "--- \n", "\n", - "The check-in dataset in consideration, with a size that surpasses that of the other three datasets obtained, comprises of [22,809,624] check-ins made by [114,324] users at [3,820,891] venues. Additionally, the social network data consists of [607,333] friendships. As previously indicated, the need for sub-sampling arises due to the size of the full network. In this instance, we shall restrict our analysis to data generated in Italy in the year 2012. Given the substantial size of the full network, plotting it would likely result in an unfavorable outcome, as the available RAM may become exhausted and the kernel may be forced to terminate the process." + "The check-in dataset in consideration is by fare the bigger of this study, it has `22809624` check-ins made by `114324` users at `3820891` venues. Additionally, the social network data consists of `607333` friendships. As previously indicated, the need for sub-sampling becomes a must due to the size of the full network. In this instance, we shall restrict our analysis to data generated in Italy in the year 2012. Given the substantial size of the full network, plotting it would likely result in an unfavorable outcome, as the available RAM may become exhausted and the kernel may be forced to terminate the process." ] }, { @@ -555,13 +625,12 @@ "name": "stdout", "output_type": "stream", "text": [ - "Starting to plot\n", "Number of unique users in Italy: 2555\n" ] }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -585,7 +654,7 @@ " dtype={'user id': str, 'venue id': str, 'UTC time': str, 'offset': int},\n", " engine='c')\n", "\n", - "# Take only the data with IT ISO code\n", + "# Take only data with IT ISO code\n", "df_foursquare_POIS = df_foursquare_POIS[df_foursquare_POIS['ISO code'] == 'IT']\n", "\n", "# Take only the checkins that are in the POIs (filtered by ISO code) and viceversa\n", @@ -602,7 +671,6 @@ "gdf_foursquare_POIS = gpd.GeoDataFrame(df_foursquare_POIS, geometry=gpd.points_from_xy(df_foursquare_POIS.longitude, df_foursquare_POIS.latitude))\n", "\n", "# plot the geopandas dataframe\n", - "print(\"Starting to plot\")\n", "gdf_foursquare_POIS.plot(marker='o', color='red', markersize=1)\n", "print('Number of unique users in Italy: ', len(df_foursquare_checkins['user id'].unique()))\n", "\n", @@ -630,7 +698,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Building the networks" + "## [Building the networks](#toc0_)" ] }, { @@ -638,9 +706,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "### [Check-ins networks](#toc0_)\n", + "\n", "The construction of networks for the three datasets will be accomplished by representing them as an undirected graph $M = (V, E)$, with $V$ denoting the set of nodes and $E$ denoting the set of edges. The nodes will correspond to the users and the edges will indicate the presence of at least one instance where two individuals visited the same location.\n", "\n", - "Since the check-ins files of the three datasets are not in the format of a graph edge list, it is necessary to manipulate them. Thus, we will examine the number of lines in each file." + "Since the check-ins files of the three datasets are not in the format of a graph edge list, it is necessary to manipulate them." ] }, { @@ -690,7 +760,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We want to construct a graph from an edge list. To accomplish this, we will examine the users that have visited each venue. Subsequently, we will establish an edge between every pair of users who have visited the same venue, while avoiding duplications. This process can be executed efficiently (algorithmically speaking) in Python, although it may entail a slow computational time due the nature of the language itself. To mitigate this issue, we are considering subsampling the data sets. The methodology for creating this graph is illustrated below in the Python code snippet.\n", + "We want to construct a graph from an edge list. To accomplish this, we will examine the users that have visited each venue. Subsequently, we will establish an edge between every pair of users who have visited the same venue, while avoiding duplications. This process can be executed efficiently (algorithmically speaking) in Python, although it may entail a slow computational time due the nature of the language itself. The methodology for creating this graph is illustrated below in the Python code snippet.\n", "\n", "```python\n", "# let df be the dataframe [\"user_id\", \"venue_id\"] of the checkins\n", @@ -704,7 +774,7 @@ "\n", "The code makes use of a dataframe, `df`, which contains the `user_id` and `venue_id` information for each check-in. The code first groups the check-ins by the `venue_id` and applies a set function to the `user_id` values. Then, the code iterates through each set of users that visited the same venue and adds an edge between every pair of users.\n", "\n", - "I have included a function in the `utils.py` module that performs this process automatically. The function, named `create_graph_from_checkins`, takes as input the name of the data set and returns a graph object in the NetworkX library. By default, this function also writes the edge list to a file in the respective data set folder. The available options for the input data set are \"brightkite\", \"gowalla\", and \"foursquare\". An example of how to use this function is shown below." + "I have included a function in the `utils.py` module that performs this process automatically. The function, named `create_graph_from_checkins`, takes as input the name of the data set and returns a graph object in the NetworkX library. By default, this function also writes the edge list to a file in the respective dataset folder. The available options for the input dataset are \"brightkite\", \"gowalla\", and \"foursquare\". An example of how to use this function is shown below." ] }, { @@ -724,14 +794,14 @@ "name": "stderr", "output_type": "stream", "text": [ - "100%|██████████| 84831/84831 [00:00<00:00, 211146.01it/s]\n" + "100%|██████████| 84831/84831 [00:00<00:00, 310564.31it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "Done! The graph has 292973 edges and 6493 nodes\n", + "Done! The graph has 292973 edges and 6493 nodes\n", "\n", "Creating the graph for the dataset gowalla...\n" ] @@ -740,14 +810,14 @@ "name": "stderr", "output_type": "stream", "text": [ - "100%|██████████| 31095/31095 [00:00<00:00, 333977.49it/s]\n" + "100%|██████████| 31095/31095 [00:00<00:00, 331735.68it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "Done! The graph has 62790 edges and 3073 nodes\n", + "Done! The graph has 62790 edges and 3073 nodes\n", "\n", "Creating the graph for the dataset foursquare...\n" ] @@ -756,43 +826,40 @@ "name": "stderr", "output_type": "stream", "text": [ - "100%|██████████| 40650/40650 [00:00<00:00, 150057.87it/s]\n" + "100%|██████████| 40650/40650 [00:00<00:00, 147409.04it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "Done! The graph has 246702 edges and 2324 nodes\n" + "Done! The graph has 246702 edges and 2324 nodes\n" ] } ], "source": [ - "G_brighkite_checkins = create_graph_from_checkins('brightkite')\n", + "G_brighkite_checkins = create_graph_from_checkins('brightkite', create_file=True)\n", "G_brighkite_checkins.name = 'Brightkite Checkins Graph'\n", "\n", - "G_gowalla_checkins = create_graph_from_checkins('gowalla')\n", + "G_gowalla_checkins = create_graph_from_checkins('gowalla', create_file=True)\n", "G_gowalla_checkins.name = 'Gowalla Checkins Graph'\n", "\n", - "G_foursquare_checkins = create_graph_from_checkins('foursquare')\n", + "G_foursquare_checkins = create_graph_from_checkins('foursquare', create_file=True)\n", "G_foursquare_checkins.name = 'Foursquare Checkins Graph'" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Friendship network\n", + "### [Friendship network](#toc0_)\n", "\n", "\n", "We want to construct a friendship graph that represents the relationships between users in a social network. The concept of friendship will be modeled in accordance with the paradigm of Facebook, as opposed to Twitter. Consequently, the graph will be undirected and edges will not be weighted. Moreover, it is imperative to note that a user cannot be friends with himself, nor can he be friends with another user if the latter is not friends with him.\n", "\n", "The friendship graph will be generated using the function `create_friendships_graph` located in the `utils.py` module. The function takes as input the name of the dataset and returns a networkx graph object. By default, the edge list will also be written to a file within the respective dataset folder. The available options for the input dataset are _brightkite_, _gowalla_, and _foursquare_.\n", "\n", - "> It is worth mentioning that this function has been implemented in a manner that does not require the checkins graph to be loaded in memory. Instead, it utilizes the edge list file. This was done with the consideration that some users may only perform analysis on the friendship network, and as such, there is no need to load the checkins graph and waste memory. Furthermore, networkx has been observed to be significantly slow when loading a graph from an edge list file.\n", - "\n", - "In conclusion, the implementation and usage of the create_friendships_graph function is demonstrated as follows:" + "> It is worth mentioning that this function has been implemented in a manner that does not require the checkins graph to be loaded in memory. Instead, it utilizes the edge list file. This was done with the consideration that some users may only perform analysis on the friendship network, and as such, there is no need to load the checkins graph and waste memory. " ] }, { @@ -804,68 +871,23 @@ "name": "stdout", "output_type": "stream", "text": [ - "Computation done for Brightkite friendship graph\n", - "Computation done for Gowalla friendship graph\n", - "Computation done for Foursquare friendship graph\n" + "Created the graph for the dataset brightkite with 14690 edges and 5420 nodes\n", + "Created the graph for the dataset gowalla with 5548 edges and 2294 nodes\n", + "Created the graph for the dataset foursquare with 5323 edges and 1397 nodes\n" ] } ], "source": [ - "G_brighkite_friends = create_friendships_graph('brightkite')\n", - "print(\"Computation done for Brightkite friendship graph\")\n", + "G_brighkite_friends = create_friendships_graph('brightkite', create_file=True)\n", "G_brighkite_friends.name = 'Brightkite Friendship Graph'\n", "\n", - "\n", - "G_gowalla_friends = create_friendships_graph('gowalla')\n", - "print(\"Computation done for Gowalla friendship graph\")\n", + "G_gowalla_friends = create_friendships_graph('gowalla', create_file=True)\n", "G_gowalla_friends.name = 'Gowalla Friendship Graph'\n", "\n", - "\n", - "G_foursquare_friends = create_friendships_graph('foursquare')\n", - "print(\"Computation done for Foursquare friendship graph\")\n", + "G_foursquare_friends = create_friendships_graph('foursquare', create_file=True)\n", "G_foursquare_friends.name = 'Foursquare Friendship Graph'" ] }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have our graphs, let's have a look at some basic information about them" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Brightkite Friendship Graph\n", - "Number of nodes: 5420\n", - "Number of edges: 14690\n", - "\n", - "Gowalla Friendship Graph\n", - "Number of nodes: 2294\n", - "Number of edges: 5548\n", - "\n", - "Foursquare Friendship Graph\n", - "Number of nodes: 1397\n", - "Number of edges: 5323\n", - "\n" - ] - } - ], - "source": [ - "for G in [G_brighkite_friends, G_gowalla_friends, G_foursquare_friends]:\n", - " print(G.name)\n", - " print('Number of nodes: ', G.number_of_nodes())\n", - " print('Number of edges: ', G.number_of_edges())\n", - " print()" - ] - }, { "attachments": {}, "cell_type": "markdown", @@ -878,7 +900,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -897,11 +919,11 @@ "for graph in friendships_graph:\n", " visualize_graphs(graph, k = None, connected=True)\n", "\n", - "# if we are curios about the checkins graphs, nothing prevents us to visualize them. Just uncomment the following lines\n", + "# if we are curios about the checkins graphs, nothing prevents us to visualize them. Just uncomment the following lines. However, this has not been tested properly, so we might encounter some slow-downs with the checkins graphs\n", "\n", "# checkins_graph = [G_brighkite_checkins, G_gowalla_checkins, G_foursquare_checkins]\n", "# for graph in checkins_graph:\n", - "# visualize_graphs(graph, k = None, connected=True)" + "# visualize_graphs(graph, k = 0.7, connected=True)" ] }, { @@ -909,7 +931,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "On a unix environment, if firefox is installed, we visualize them by running the following command:\n", + "On a unix environment, if firefox is installed, we can visualize them by running the following command:\n", "\n", "```bash\n", "firefox html_graphs/*.html\n", @@ -925,20 +947,19 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "# Properties of the networks\n", + "# [Properties of the networks](#toc0_)\n", "\n", - "In order to effectively visualize the outcomes of our analysis, we will construct a dataframe that encapsulates all the information retrieved from the networks under examination. It should be noted that the full networks, despite having undergone filtering, are still substantial in size, which results in prolonged execution times for the functions utilized. To mitigate this issue, we will take sub-samples. It is important to keep in mind that the accuracy of the results is proportional to the size of the sample utilized.\n", + "In order to effectively visualize the outcomes of our analysis, we will construct a dataframe that encapsulates all the information retrieved from the networks under examination. It should be noted that the full networks, despite having undergone filtering, are still substantial in size, which results in prolonged execution times for the functions that we are going to use. To mitigate this issue, we will take sub-samples. Keep in mind that the accuracy of the results is proportional to the size of the sample utilized.\n", "\n", - "In light of these considerations, I recommend conducting an initial review of the notebook with higher values of the sampling rate to expedite the display of the results and gain an understanding of the functionality of the implemented functions. At the end of this section I provided a link to my GitHub repository, where the results obtained through lower sampling rates can be downloaded. This approach allows for a preliminary assessment of the functionality of the functions with mock-networks, before proceeding with the analysis using the more precise results that necessitate longer computation times." + "In light of these considerations, I recommend conducting an initial review of the notebook with higher values of the sampling rate (the default ones) to expedite the display of the results and gain an understanding of the functionality of the implemented functions. At the end of this section I provided a link to my GitHub repository, where the results obtained through lower sampling rates can be downloaded. This approach allows for a preliminary assessment of the functionality of the functions with mock-networks, before proceeding with the analysis using the more precise results that necessitate longer computation times." ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -950,7 +971,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 24, "metadata": {}, "outputs": [ { @@ -1026,7 +1047,7 @@ " 3\n", " Brightkite Friendship Graph\n", " 1500\n", - " 1180\n", + " 1072\n", " NaN\n", " NaN\n", " 7.313220\n", @@ -1038,7 +1059,7 @@ " 4\n", " Gowalla Friendship Graph\n", " 1500\n", - " 2512\n", + " 2309\n", " NaN\n", " NaN\n", " 7.313220\n", @@ -1067,8 +1088,8 @@ "0 Brightkite Checkins Graph 6493 292973 NaN \n", "1 Gowalla Checkins Graph 3073 62790 NaN \n", "2 Foursquare Checkins Graph 2324 246702 NaN \n", - "3 Brightkite Friendship Graph 1500 1180 NaN \n", - "4 Gowalla Friendship Graph 1500 2512 NaN \n", + "3 Brightkite Friendship Graph 1500 1072 NaN \n", + "4 Gowalla Friendship Graph 1500 2309 NaN \n", "5 Foursquare Friendship Graph 1397 5323 NaN \n", "\n", " Average Clustering Coefficient log N Average Shortest Path Length \\\n", @@ -1088,7 +1109,7 @@ "5 NaN NaN " ] }, - "execution_count": 17, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -1108,18 +1129,17 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Average Degree\n", + "## [Average Degree](#toc0_)\n", "\n", "The concept of degree refers to the number of links that are connected to a particular node. While the average degree is a basic measure, it is not deemed to be of significant utility for our upcoming analysis, thus it will not be given extensive consideration. In contrast, the degree distribution, represented by $P(k)$, which represents the proportion of nodes that have a degree of $k$, is deemed to be a more meaningful metric. The literature on network analysis indicates that real-world networks often do not adhere to the Poisson degree distribution that is predicted by the ER model. Instead, many networks exhibit a degree distribution with a long-tailed, power-law distribution, such that $P(k) \\sim k^{-\\gamma}$, with a value of $\\gamma$ typically ranging from $2$ to $3$." ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -1166,12 +1186,12 @@ " \n", " 3\n", " Brightkite Friendship Graph\n", - " 1.573333\n", + " 1.429333\n", " \n", " \n", " 4\n", " Gowalla Friendship Graph\n", - " 3.349333\n", + " 3.078667\n", " \n", " \n", " 5\n", @@ -1187,12 +1207,12 @@ "0 Brightkite Checkins Graph 90.242723\n", "1 Gowalla Checkins Graph 40.865604\n", "2 Foursquare Checkins Graph 212.30809\n", - "3 Brightkite Friendship Graph 1.573333\n", - "4 Gowalla Friendship Graph 3.349333\n", + "3 Brightkite Friendship Graph 1.429333\n", + "4 Gowalla Friendship Graph 3.078667\n", "5 Foursquare Friendship Graph 7.620616" ] }, - "execution_count": 18, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -1206,11 +1226,10 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Clustering coefficient\n", + "## [Clustering coefficient](#toc0_)\n", "\n", "The Clustering Coefficient `[2]` refers to the concept of communities represented by local structures in a network. This notion is generally related to the number of triangles present in the network and considered high when two nodes sharing a common neighbor exhibit a high probability of being connected. There are two commonly accepted definitions of clustering: the global definition and the local definition.\n", "\n", @@ -1239,7 +1258,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -1249,22 +1268,22 @@ "\n", "Computing average clustering coefficient for the Brightkite Checkins Graph...\n", "\tAverage clustering coefficient: 0.7139988006862793\n", - "\tCPU time: 13.8 seconds\n", + "\tCPU time: 13.7 seconds\n", "\n", "Computing average clustering coefficient for the Gowalla Checkins Graph...\n", "\tAverage clustering coefficient: 0.5483724940778376\n", - "\tCPU time: 1.6 seconds\n", + "\tCPU time: 1.5 seconds\n", "\n", "Computing average clustering coefficient for the Foursquare Checkins Graph...\n", "\tAverage clustering coefficient: 0.6527297407924693\n", "\tCPU time: 17.5 seconds\n", "\n", "Computing average clustering coefficient for the Brightkite Friendship Graph...\n", - "\tAverage clustering coefficient: 0.0798461919826572\n", + "\tAverage clustering coefficient: 0.07238126177648738\n", "\tCPU time: 0.0 seconds\n", "\n", "Computing average clustering coefficient for the Gowalla Friendship Graph...\n", - "\tAverage clustering coefficient: 0.17020576105278323\n", + "\tAverage clustering coefficient: 0.15971676222947884\n", "\tCPU time: 0.0 seconds\n", "\n", "Computing average clustering coefficient for the Foursquare Friendship Graph...\n", @@ -1316,12 +1335,12 @@ " \n", " 3\n", " Brightkite Friendship Graph\n", - " 0.079846\n", + " 0.072381\n", " \n", " \n", " 4\n", " Gowalla Friendship Graph\n", - " 0.170206\n", + " 0.159717\n", " \n", " \n", " 5\n", @@ -1337,12 +1356,12 @@ "0 Brightkite Checkins Graph 0.713999\n", "1 Gowalla Checkins Graph 0.548372\n", "2 Foursquare Checkins Graph 0.65273\n", - "3 Brightkite Friendship Graph 0.079846\n", - "4 Gowalla Friendship Graph 0.170206\n", + "3 Brightkite Friendship Graph 0.072381\n", + "4 Gowalla Friendship Graph 0.159717\n", "5 Foursquare Friendship Graph 0.183485" ] }, - "execution_count": 19, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } @@ -1361,11 +1380,10 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Average Path Length\n", + "## [Average Path Length](#toc0_)\n", "\n", "In the context of our network analysis, it is important to note that networks are not embedded in physical space and thus, the geometrical distance between nodes becomes irrelevant. Instead, the most pertinent measure of distance in such networks is the minimal number of hops, also known as the chemical distance. This distance between two nodes is defined as the number of edges in the shortest path connecting the nodes.\n", "\n", @@ -1377,7 +1395,7 @@ "\n", "where $V$ represents the set of nodes in the graph, $n$ represents the number of nodes, and $d(s,t)$ is the shortest path length between nodes $s$ and $t$. The default algorithm used to calculate the shortest path length is the Dijkstra algorithm.\n", "\n", - "Given the size of the datasets, computing the average shortest path length for the entire graph is not feasible. To overcome this, we can use the `average_shortest_path` function from the utils module to compute the average shortest path length of a random subsample of the graph. This function requires the input of the networkx graph object and an optional parameter `k` which represents the percentage of nodes to remove from the graph. If `k` is set to None, the average shortest path length of each connected component is calculated using all the nodes of the component. The function returns the average shortest path length of the graph.\n", + "Given the size of the datasets, computing the average shortest path length for the entire graph is not feasible. To overcome this, we can use the `average_shortest_path` function from the utils module to compute the average shortest path length of a random subsample of the graph. This function requires the input of the networkx graph object and an optional parameter `k` which represents the percentage of nodes to remove from the graph. If `k` is set to `None`, the average shortest path length of each connected component is calculated using all the nodes of the component. The function returns the average shortest path length of the graph.\n", "\n", "The implementation involves first removing a random subsample of nodes from the graph, creating a list of connected components with at least 10 nodes, and then using the `average_shortest_path_length` function to calculate the average shortest path length. The choice of 10 nodes is arbitrary and based on empirical observations, as small communities with low average shortest path lengths can skew results. The value of `k` can be adjusted based on the available computing resources and time, with lower values providing more precise results but taking longer to compute and vice versa. However, in this case, the computation time is not overly excessive, so if we are willing to wait a few minutes, we can use the default value of `k` which is `None`.\n", "\n", @@ -1432,7 +1450,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -1441,46 +1459,46 @@ "text": [ "\n", "Computing average shortest path length for graph: Brightkite Checkins Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 2598\n", - "\tNumber of edges after removing 60.0% of nodes: 45813\n", + "\tNumber of nodes after removing 50.0% of nodes: 3247\n", + "\tNumber of edges after removing 50.0% of nodes: 72733\n", "\tNumber of connected components with more then 10 nodes: 1 \n", - "\tAverage shortest path length: 3.13ngth of connected component with 2281 nodes and 45750 edges \n", - "\tCPU time: 8.3 seconds\n", + "\tAverage shortest path length: 3.12ngth of connected component with 2925 nodes and 72656 edges \n", + "\tCPU time: 14.9 seconds\n", "\n", "Computing average shortest path length for graph: Gowalla Checkins Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 1230\n", - "\tNumber of edges after removing 60.0% of nodes: 10319\n", + "\tNumber of nodes after removing 50.0% of nodes: 1537\n", + "\tNumber of edges after removing 50.0% of nodes: 15177\n", "\tNumber of connected components with more then 10 nodes: 1 \n", - "\tAverage shortest path length: 3.61ngth of connected component with 1036 nodes and 10290 edges \n", - "\tCPU time: 1.4 seconds\n", + "\tAverage shortest path length: 3.73ngth of connected component with 1351 nodes and 15146 edges \n", + "\tCPU time: 2.3 seconds\n", "\n", "Computing average shortest path length for graph: Foursquare Checkins Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 930\n", - "\tNumber of edges after removing 60.0% of nodes: 37891\n", + "\tNumber of nodes after removing 50.0% of nodes: 1162\n", + "\tNumber of edges after removing 50.0% of nodes: 58703\n", "\tNumber of connected components with more then 10 nodes: 1 \n", - "\tAverage shortest path length: 2.23ngth of connected component with 893 nodes and 37890 edges \n", - "\tCPU time: 2.3 seconds\n", + "\tAverage shortest path length: 2.21ngth of connected component with 1109 nodes and 58698 edges \n", + "\tCPU time: 3.7 seconds\n", "\n", "Computing average shortest path length for graph: Brightkite Friendship Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 600\n", - "\tNumber of edges after removing 60.0% of nodes: 204\n", - "\tNumber of connected components with more then 10 nodes: 2 \n", - "\tAverage shortest path length: 7.01ngth of connected component with 14 nodes and 15 edges s \n", + "\tNumber of nodes after removing 50.0% of nodes: 750\n", + "\tNumber of edges after removing 50.0% of nodes: 212\n", + "\tNumber of connected components with more then 10 nodes: 3 \n", + "\tAverage shortest path length: 11.69gth of connected component with 24 nodes and 27 edges \n", "\tCPU time: 0.0 seconds\n", "\n", "Computing average shortest path length for graph: Gowalla Friendship Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 600\n", - "\tNumber of edges after removing 60.0% of nodes: 374\n", + "\tNumber of nodes after removing 50.0% of nodes: 750\n", + "\tNumber of edges after removing 50.0% of nodes: 580\n", "\tNumber of connected components with more then 10 nodes: 3 \n", - "\tAverage shortest path length: 12.07gth of connected component with 79 nodes and 108 edges \n", + "\tAverage shortest path length: 11.87gth of connected component with 11 nodes and 11 edges s \n", "\tCPU time: 0.0 seconds\n", "\n", "Computing average shortest path length for graph: Foursquare Friendship Graph\n", - "\tNumber of nodes after removing 60.0% of nodes: 559\n", - "\tNumber of edges after removing 60.0% of nodes: 860\n", + "\tNumber of nodes after removing 50.0% of nodes: 699\n", + "\tNumber of edges after removing 50.0% of nodes: 1636\n", "\tNumber of connected components with more then 10 nodes: 1 \n", - "\tAverage shortest path length: 3.65ngth of connected component with 279 nodes and 786 edges \n", - "\tCPU time: 0.1 seconds\n" + "\tAverage shortest path length: 3.74ngth of connected component with 468 nodes and 1563 edges \n", + "\tCPU time: 0.2 seconds\n" ] }, { @@ -1512,32 +1530,32 @@ " \n", " 0\n", " Brightkite Checkins Graph\n", - " 3.125381\n", + " 3.117784\n", " \n", " \n", " 1\n", " Gowalla Checkins Graph\n", - " 3.614332\n", + " 3.726131\n", " \n", " \n", " 2\n", " Foursquare Checkins Graph\n", - " 2.234452\n", + " 2.213298\n", " \n", " \n", " 3\n", " Brightkite Friendship Graph\n", - " 7.01471\n", + " 11.694492\n", " \n", " \n", " 4\n", " Gowalla Friendship Graph\n", - " 12.070457\n", + " 11.865207\n", " \n", " \n", " 5\n", " Foursquare Friendship Graph\n", - " 3.64609\n", + " 3.739097\n", " \n", " \n", "\n", @@ -1545,27 +1563,27 @@ ], "text/plain": [ " Graph Average Shortest Path Length\n", - "0 Brightkite Checkins Graph 3.125381\n", - "1 Gowalla Checkins Graph 3.614332\n", - "2 Foursquare Checkins Graph 2.234452\n", - "3 Brightkite Friendship Graph 7.01471\n", - "4 Gowalla Friendship Graph 12.070457\n", - "5 Foursquare Friendship Graph 3.64609" + "0 Brightkite Checkins Graph 3.117784\n", + "1 Gowalla Checkins Graph 3.726131\n", + "2 Foursquare Checkins Graph 2.213298\n", + "3 Brightkite Friendship Graph 11.694492\n", + "4 Gowalla Friendship Graph 11.865207\n", + "5 Foursquare Friendship Graph 3.739097" ] }, - "execution_count": 20, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# if you want just to test it out, leave k = 0.6, it will only take a few seconds. More accurate results will be available to download after\n", + "# if you want just to test it out, leave k = 0.5, it will only take a few seconds. More accurate results will be available to download after\n", "\n", "for graph in graphs_all:\n", " print(\"\\nComputing average shortest path length for graph: \", graph.name)\n", "\n", " start = time.time()\n", - " average_shortest_path_length = average_shortest_path(graph, k = 0.6)\n", + " average_shortest_path_length = average_shortest_path(graph, k = 0.5)\n", " end = time.time()\n", "\n", " print(\"\\tAverage shortest path length: {}\".format(round(average_shortest_path_length,2)))\n", @@ -1581,7 +1599,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Betweenness Centrality\n", + "## [Betweenness Centrality](#toc0_)\n", "\n", "In a network, the significance of a node is dependent on a multitude of factors. The importance of a website could stem from its content while that of a router could stem from its capacity. However, these properties are contingent upon the type of network being studied and may have limited correlation with the graph structure of the network. We are interested on the importance of a node or link in terms of its topological function within the network. It is reasonable to infer that the topology of a network may intrinsically dictate the significance of different nodes. One measure of centrality is the degree of a node, where the higher the degree, the greater the node's connectivity, and thus, its centrality in the network. Nevertheless, the degree is not the sole determinant of a node's significance.\n", "\n", @@ -1595,20 +1613,18 @@ "\n", "--- \n", "\n", - "The networkx library, which is a commonly used library for network analysis, includes a function for computing the betweenness centrality of all nodes in a network. This function is based on the algorithm proposed by Ulrik Brandes in `[16]`, which involves the calculation of shortest paths between all pairs of nodes in the network and counting the number of shortest paths that pass through each node.\n", + "The networkx library includes a function for computing the betweenness centrality of all nodes in a network. This function is based on the algorithm proposed by Ulrik Brandes in `[16]`, which involves the calculation of shortest paths between all pairs of nodes in the network and counting the number of shortest paths that pass through each node.\n", "\n", - "However, the computation of this algorithm on large networks may not be feasible within a reasonable time frame due to the computational cost. To mitigate this issue, a sampling approach can be employed, which provides approximate results. Nevertheless, even with heavy sampling, the computation time remains prohibitively high. To avoid further sampling, which would introduce bias, we will use a parallelization approach to speed up the computation.\n", + "However, the computation of this algorithm on large networks may not be feasible within a reasonable time frame due to the computational cost. To mitigate this issue we can use a sampling approach can be employed. Nevertheless, even with heavy sampling, the computation time remains prohibitively high. To avoid further sampling, which would introduce bias, we will use a parallelization approach to speed up the computation.\n", "\n", "In the `utils` module, I have implemented a function called `betweenness_centrality_parallel` that uses this approach. The function takes as input a networkx graph object, the number of processes to use for computation (default is 1, which uses the standard betweenness algorithm), and the percentage of nodes to remove from the graph (default is `None`, which uses all nodes of the connected component to compute the average shortest path length). The function divides the network into _chunks_ of nodes and computes their contribution to the betweenness centrality of the whole network in parallel, ultimately returning a dictionary of the betweenness centrality of each node.\n", "\n", - "In the `utils` module I implemented a function called `betweenness_centrality_parallel`. The function takes as input\n", - "\n", "Please note that for large graphs, it is advisable to not use more than 6 processes to avoid memory constraints. The number of processes to use can be determined based on the available time and the machine being used. For small graphs, more processes may be used. As for the percentage of nodes to remove, lower values provide more precise results but take longer to compute, while higher values result in less precise results but are faster to compute. It is suggested to start with `k=0.6` for a quick test and use `k=0.2` for a more precise result. For more information, refer to the function code in the `utils` module." ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 20, "metadata": {}, "outputs": [ { @@ -1617,40 +1633,52 @@ "text": [ "\n", "Computing the approximate betweenness centrality for the Brightkite Checkins Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 3247\n", - "\tNumber of edges after removing 50.0% of nodes: 67822\n", - "\tBetweenness centrality: 0.0005491138340434757 \n", - "\tCPU time: 15.2 seconds\n", + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 3895\n", + "Number of edges in the sampled graph: 109595\n", + "\tBetweenness centrality: 0.000571670929493879 \n", + "\tCPU time: 98.5 seconds\n", "\n", "Computing the approximate betweenness centrality for the Gowalla Checkins Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 1537\n", - "\tNumber of edges after removing 50.0% of nodes: 14783\n", - "\tBetweenness centrality: 0.001392101838155042 \n", - "\tCPU time: 2.4 seconds\n", + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 1843\n", + "Number of edges in the sampled graph: 23885\n", + "\tBetweenness centrality: 0.0015210855257160798 \n", + "\tCPU time: 12.7 seconds\n", "\n", "Computing the approximate betweenness centrality for the Foursquare Checkins Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 1162\n", - "\tNumber of edges after removing 50.0% of nodes: 62236\n", - "\tBetweenness centrality: 0.0009206912041184174 \n", - "\tCPU time: 5.1 seconds\n", + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 1394\n", + "Number of edges in the sampled graph: 86059\n", + "\tBetweenness centrality: 0.0009135778105161979 \n", + "\tCPU time: 30.7 seconds\n", "\n", "Computing the approximate betweenness centrality for the Brightkite Friendship Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 750\n", - "\tNumber of edges after removing 50.0% of nodes: 259\n", - "\tBetweenness centrality: 2.8891760612486286e-05 \n", + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 900\n", + "Number of edges in the sampled graph: 381\n", + "\tBetweenness centrality: 0.024814375935463612 \n", "\tCPU time: 0.3 seconds\n", "\n", "Computing the approximate betweenness centrality for the Gowalla Friendship Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 750\n", - "\tNumber of edges after removing 50.0% of nodes: 584\n", - "\tBetweenness centrality: 0.001038085242593214 \n", - "\tCPU time: 0.3 seconds\n", + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 900\n", + "Number of edges in the sampled graph: 797\n", + "\tBetweenness centrality: 0.014602979710672098 \n", + "\tCPU time: 0.4 seconds\n", "\n", "Computing the approximate betweenness centrality for the Foursquare Friendship Graph...\n", - "\tNumber of nodes after removing 50.0% of nodes: 699\n", - "\tNumber of edges after removing 50.0% of nodes: 1336\n", - "\tBetweenness centrality: 0.0018510961526383627 \n", - "\tCPU time: 0.4 seconds\n" + "\n", + "Graph is not connected. Taking the largest connected component\n", + "Number of nodes in the sampled graph: 838\n", + "Number of edges in the sampled graph: 1860\n", + "\tBetweenness centrality: 0.00518979036959232 \n", + "\tCPU time: 0.8 seconds\n" ] }, { @@ -1682,32 +1710,32 @@ " \n", " 0\n", " Brightkite Checkins Graph\n", - " 0.000549\n", + " 0.000572\n", " \n", " \n", " 1\n", " Gowalla Checkins Graph\n", - " 0.001392\n", + " 0.001521\n", " \n", " \n", " 2\n", " Foursquare Checkins Graph\n", - " 0.000921\n", + " 0.000914\n", " \n", " \n", " 3\n", " Brightkite Friendship Graph\n", - " 0.000029\n", + " 0.024814\n", " \n", " \n", " 4\n", " Gowalla Friendship Graph\n", - " 0.001038\n", + " 0.014603\n", " \n", " \n", " 5\n", " Foursquare Friendship Graph\n", - " 0.001851\n", + " 0.00519\n", " \n", " \n", "\n", @@ -1715,15 +1743,15 @@ ], "text/plain": [ " Graph betweenness centrality\n", - "0 Brightkite Checkins Graph 0.000549\n", - "1 Gowalla Checkins Graph 0.001392\n", - "2 Foursquare Checkins Graph 0.000921\n", - "3 Brightkite Friendship Graph 0.000029\n", - "4 Gowalla Friendship Graph 0.001038\n", - "5 Foursquare Friendship Graph 0.001851" + "0 Brightkite Checkins Graph 0.000572\n", + "1 Gowalla Checkins Graph 0.001521\n", + "2 Foursquare Checkins Graph 0.000914\n", + "3 Brightkite Friendship Graph 0.024814\n", + "4 Gowalla Friendship Graph 0.014603\n", + "5 Foursquare Friendship Graph 0.00519" ] }, - "execution_count": 21, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } @@ -1732,7 +1760,7 @@ "for graph in graphs_all:\n", " print(\"\\nComputing the approximate betweenness centrality for the {}...\".format(graph.name))\n", " start = time.time()\n", - " betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.5).values()))\n", + " betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.4).values()))\n", " end = time.time()\n", " print(\"\\tBetweenness centrality: {} \".format(betweenness_centrality))\n", " print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n", @@ -1743,18 +1771,17 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Download the dataframe with accurate results\n", + "## [Download the dataframe with accurate results](#toc0_)\n", "\n", "All the results from the previous section are available in the following dataframe. Each function as been executed using as less sampling as possible, some of them took hours to complete. The dataframe is available in the block below" ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 33, "metadata": {}, "outputs": [ { @@ -1786,7 +1813,6 @@ " log N\n", " Average Shortest Path Length\n", " betweenness centrality\n", - " omega-coefficient\n", " \n", " \n", " \n", @@ -1800,7 +1826,6 @@ " 8.778480\n", " 3.157626\n", " 0.000533\n", - " -0.25\n", " \n", " \n", " 1\n", @@ -1812,7 +1837,6 @@ " 8.030410\n", " 3.827384\n", " 0.001395\n", - " -0.20\n", " \n", " \n", " 2\n", @@ -1824,7 +1848,6 @@ " 7.751045\n", " 2.217319\n", " 0.000964\n", - " -0.17\n", " \n", " \n", " 3\n", @@ -1836,7 +1859,6 @@ " 7.313220\n", " 5.921753\n", " 0.000039\n", - " -0.18\n", " \n", " \n", " 4\n", @@ -1848,7 +1870,6 @@ " 7.313220\n", " 8.975185\n", " 0.001483\n", - " -0.24\n", " \n", " \n", " 5\n", @@ -1860,7 +1881,6 @@ " 7.242082\n", " 3.868567\n", " 0.001803\n", - " -0.05\n", " \n", " \n", "\n", @@ -1883,34 +1903,38 @@ "4 0.174906 7.313220 8.975185 \n", "5 0.183485 7.242082 3.868567 \n", "\n", - " betweenness centrality omega-coefficient \n", - "0 0.000533 -0.25 \n", - "1 0.001395 -0.20 \n", - "2 0.000964 -0.17 \n", - "3 0.000039 -0.18 \n", - "4 0.001483 -0.24 \n", - "5 0.001803 -0.05 " + " betweenness centrality \n", + "0 0.000533 \n", + "1 0.001395 \n", + "2 0.000964 \n", + "3 0.000039 \n", + "4 0.001483 \n", + "5 0.001803 " ] }, - "execution_count": 22, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# if not os.path.exists(os.path.join('server_results', 'analysis_results.pkl')):\n", - "# print(\"Downloading the analysis results file...\")\n", - "# wget.download('https://github.com/lukefleed/small-worlds/raw/main/server_results/analysis_results.pkl', out='server_results')\n", + "if not os.path.exists(os.path.join('server_results', 'analysis_results.pkl')):\n", + " print(\"Downloading the analysis results file...\")\n", + "\n", + " if not os.path.exists('server_results'):\n", + " os.mkdir('server_results')\n", + "\n", + " wget.download('https://github.com/lukefleed/small-worlds/raw/main/server_results/analysis_results.pkl', out='server_results/analysis_results.pkl')\n", "\n", "analysis_results = pd.read_pickle('server_results/analysis_results.pkl')\n", - "analysis_results" + "analysis_results.iloc[:, :-1] # we do not need last column for now" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Analysis of the results" + "# [Analysis of the results](#toc0_)" ] }, { @@ -1918,18 +1942,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Distribution of Degree\n", + "### [Distribution of Degree](#toc0_)\n", "\n", - "In the preceding section, we established that a scale-free network exhibits a skewed distribution of node degrees, resulting from a few nodes possessing a significantly higher number of connections compared to the majority of nodes. Such networks contain \"hubs\" or high-degree nodes that play a disproportionate role in the structure and function of the network. Conversely, a random network showcases a more uniform distribution of node degrees with nodes possessing approximately the same number of connections.\n", + "In the previous section, we established that a scale-free network exhibits a skewed distribution of node degrees, resulting from a few nodes possessing a significantly higher number of connections compared to the majority of nodes. Such networks contain \"hubs\" or high-degree nodes that play a disproportionate role in the structure and function of the network. On the other hand, a random network showcases a more uniform distribution of node degrees with nodes possessing approximately the same number of connections.\n", "\n", "---\n", "\n", - "We shall now determine if our networks are scale-free or not. To this end, we utilize the `degree_distribution` function from the `utils` module to plot the degree distribution of a graph. The function accepts a networkx graph object as input and returns a plot of the degree distribution. We anticipate observing a power-law distribution, rather than a Poissonian distribution." + "We can now determine if our networks are scale-free or not. To this end, we utilize the `degree_distribution` function from the `utils` module to plot the degree distribution of a graph. The function accepts a networkx graph object as input and returns a plot of the degree distribution. We anticipate observing a power-law distribution, rather than a Poissonian distribution." ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -3854,9 +3878,9 @@ } }, "text/html": [ - "