small-worlds/main.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import os\n",
    "import time\n",
    "import wget\n",
    "import zipfile\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import networkx as nx\n",
    "import multiprocessing\n",
    "import geopandas as gpd\n",
    "import plotly.graph_objects as go\n",
    "from collections import Counter\n",
    "from src.utils import *\n",
    "from tqdm import tqdm\n",
    "\n",
    "# ignore warnings\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aim of the project\n",
    "\n",
    "Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. \n",
    "\n",
    "These networks, typically generated directly or indirectly by human activity and interaction (and therefore hereafter dubbed social”), appear in a large variety of contexts and often exhibit a surprisingly similar structure. One of the most important notions that researchers have been trying to capture in such networks is “node centrality”: ideally, every node (often representing an individual) has some degree of influence or importance within the social domain under  consideration, and one expects such importance to surface in the structure of the social network; centrality is a quantitative measure that aims at revealing the importance of a node.\n",
    "\n",
    "Among the types of centrality that have been considered in the literature, many have to do with distances between nodes. Take, for instance, a node in an undirected connected network: if the sum of distances to all other nodes is large, the node under consideration is peripheral; this is the starting point to define Bavelas's closeness centrality, which is the reciprocal of peripherality (i.e., the reciprocal of the sum of distances to all other nodes). \n",
    "\n",
    "The role played by shortest paths is justified by one of the most well-known features of complex networks, the so-called **small-world phenomenon**. A small-world network is a graph where the average distance between nodes is logarithmic in the size of the network, whereas the clustering coefficient is larger (that is, neighborhoods tend to be denser) than in a random Erdős-Rényi graph with the same size and average distance. The fact that social networks (whether electronically mediated or not) exhibit the small-world property is known at least since Milgram's famous experiment and is arguably the most popular of all features of complex networks. For instance, the average distance of the Facebook graph was recently established to be just $4.74$.\n",
    "\n",
    "---\n",
    "\n",
    "We aim to study the small-world phenomenon in the context of social networks, and to do so we will consider a large number of centrality measures. We will use 3 real-world datasets, trying to understand how the small-world phenomenon manifests itself in each of them. We will also try to understand how the small-world phenomenon is affected by the choice of centrality measure."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random Networks: The Erdős-Rényi model\n",
    "\n",
    "<!-- Before 1960, graph theory mainly dealt with the properties of specific individual graphs. In the 1960s, Paul Erdős and Alfred Rényi initiated a systematic study of random graphs. Random graph theory is, in fact, not the study of individual graphs, but the study of a statistical ensemble of graphs (or, as mathematicians prefer to call it, a probability space of graphs). The ensemble is a class consisting of many different graphs, where each graph has a probability attached to it. A property studied is said to exist with probability $P$ if the total probability of a graph in the ensemble possessing that property is $P$ (or the total fraction of graphs in the ensemble that has this property is $P$). This approach allows the use of probability theory in conjunction with discrete mathematics for studying graph ensembles.  A property is said to exist for a class of graphs if the fraction of graphs in the ensemble which does not have this property is of zero measure. This is usually termed as a property of \\emph{almost every (a.e.)} graph. Sometimes the terms “almost surely” or “with high probability” are also used (with the former usually taken to mean that the residual probability vanishes exponentially with the system size).  -->\n",
    "\n",
    "Prior to the 1960s, graph theory primarily focused on the characteristics of individual graphs. In the 1960s, Paul Erdős and Alfred Rényi introduced a systematic approach to studying random graphs, which involves analyzing a collection, or ensemble, of many different graphs. Each graph in the ensemble is assigned a probability, and a property is said to hold with probability $P$ if the total probability of the graphs in the ensemble possessing that property is $P$, or if the fraction of graphs in the ensemble with the property is $P$. This method allows for the application of probability theory in conjunction with discrete math to study ensembles of graphs. A property is considered to hold for a class of graphs if the fraction of graphs in the ensemble without the property has zero measure, which is typically referred to as being true for \"almost every\" graph in the ensemble. `[2]`\n",
    "\n",
    "## Definition of a random graph\n",
    "\n",
    "Let $E_{n,N}$ denote the set of alla graphs having $n$ given labelled vertices $V_1,V_2, \\dots, V_n$ and $N$ edges [1]. The graphs considered are supposed to be not oriented, without parallel edges and without slings. Thus a graph belonging to $E_{n,N}$ is obtained by choosing $N$ out of the $\\binom{n}{2}$ possible edges between the points $V_1,V_2, \\dots, V_n$, and therefore the number of elements of $E_{n,N}$ is given by the binomial coefficient $\\binom{\\binom{n}{2}}{N}$. \n",
    "\n",
    "A random graph $\\Gamma_{n,N}$ can be defined as a element of $E_{n,N}$ chosen at random, so that each of the elements of $E_{n,N}$ has the same probability of being chosen, namely $\\frac{1}{\\binom{\\binom{n}{2}}{N}}$.\n",
    "\n",
    "Let's try to modify this point of view and use a bit of probability theory. _We may consider the formation of a random graph as a stochastic process_ defined as follows: At time $t=1$ we choose out of the $\\binom{n}{2}$ possible edges between the points $V_1,V_2, \\dots, V_n$ $N$ edges, each of this edges having the same probability of being chosen; let this edge be denoted as $e_1$. At time $t=2$ we choose one of the possible $\\binom{n}{2} -1$, different from $e_1$, all this being equiprobable. Continuing this process at time $t=k+1$ we choose one of the possible $\\binom{n}{2} -k$, different from $e_1, e_2, \\dots, e_k$, all this being equiprobable, i.e having the probability $\\frac{1}{\\binom{n}{2} -k}$. We denote $\\Gamma_{n,N}$ the graph obtained by choosing $N$ edges in this way.\n",
    "\n",
    "> NOTE: the two definitions are equivalent, but the second one is more convenient for the study of the properties of random graphs. According to this interpretation we may study the evolution of random graphs, i.e. the step-by-step unraveling of the structure of the graph when $N$ increases. This will be an essential point in our study of the properties of small-worldness.\n",
    "\n",
    "\n",
    "##  Erdős-Rényi graphs\n",
    "\n",
    "There are two well-known ensembles of graphs that have been extensively studied: the ensemble of all graphs with $N$ nodes and $M$ edges, denoted $G_{N,M}$, and the ensemble of all graphs with $N$ nodes and a probability $p$ of any two nodes being connected, denoted $G_{N,p}$. These ensembles, initially studied by Erdős and Rényi, are similar when $M = \\binom{N}{2} p$, and are therefore referred to as ER graphs when $p$ is not too close to $0$ or $1$. `[2]`\n",
    "\n",
    "An important feature of a graph is its average degree, or the average number of edges connected to each node. We will denote the degree of the $i$-th node by $k_i$ and the average degree by $\\langle r \\rangle$. Graphs with $N$ nodes and $\\langle k \\rangle = O(N^0)$ are called sparse graphs.\n",
    "\n",
    "One interesting property of the ensemble $G_{N,p}$ is that many of its characteristics have a corresponding threshold function, $p_t(N)$, such that the property exists with probability 0 if $p < p_t$ and with probability 1 if $p > p_t$ in the \"thermodynamic limit\" of $N \\to \\infty$. This is similar to the physical concept of a percolation phase transition.\n",
    "\n",
    "Another property of interest is the average path length between any two nodes, which is typically of order $\\ln N$ in almost every graph of the ensemble (with $\\langle k \\rangle > 1$ and finite). This small, logarithmic distance is the source of the \"small-world\" phenomena that are characteristic of networks.\n",
    "\n",
    "## Scale-free networks\n",
    "\n",
    "The Erdős-Rényi model `[4]` has long been the primary focus of research in the field of random graphs. However, recent studies of real-world networks have shown that the ER model does not accurately capture many of their observed properties. One such property that can be easily measured is the degree distribution, or the fraction $P(k)$ of nodes with $k$ connections (degree $k$). A well-known result for ER networks is that the degree distribution follows a Poisson distribution, given by `[2]`\n",
    "\n",
    "\\begin{equation}\n",
    "P(k) = \\frac{e^{z} z^k}{k!}\n",
    "\\end{equation}\n",
    "\n",
    "where $z = \\langle k \\rangle$ is the average degree `[13]`. However, measurements of the degree distribution for real networks often show that the Poisson law does not hold, instead exhibiting a scale-free degree distribution of the form\n",
    "\n",
    "\\begin{equation}\n",
    "P(k) = ck^{-\\gamma} \\quad \\text{for} \\quad k = m, ... , K\n",
    "\\end{equation}\n",
    "\n",
    "where $c \\sim (\\gamma -1)m^{\\gamma - 1}$ is a normalization factor, and $m$ and $K$ are the lower and upper cutoffs for the degree of a node, respectively. The divergence of moments higher than $\\lceil \\gamma -1 \\rceil$ (as $K \\to \\infty$ when $N \\to \\infty$) is responsible for many of the unusual properties attributed to scale-free networks.\n",
    "\n",
    "It is important to note that all real-world networks are finite, so all of their moments are finite as well. The actual value of the cutoff $K$ plays a significant role, and can be approximated by noting that the total probability of nodes with $k > K$ is approximately $1/N$ `[14]`, or\n",
    "\n",
    "\\begin{equation}\n",
    "\\int_K^\\infty P(k) dk \\sim \\frac{1}{N}\n",
    "\\end{equation}\n",
    "\n",
    "This gives the result\n",
    "\n",
    "\\begin{equation}\n",
    "K \\sim m N^{1/(\\gamma -1)}\n",
    "\\end{equation}\n",
    "\n",
    "The degree distribution is not the only characteristic that can be used to describe a network. Other quantities, such as the degree-degree correlation (between connected nodes), spatial correlations, clustering coefficient, betweenness or centrality distribution, and self-similarity exponents, can also provide insight into the network's structure and behavior.\n",
    "\n",
    "# Diameter and fractal dimension\n",
    "\n",
    "Regular lattices can be viewed as networks embedded in Euclidean space of a defined dimension $d$, meaning that $n(r)$, the number of nodes within a distance $r$ from an origin, grows as $n(r) \\sim r^d$ for large $r$. For fractal objects, the dimension $d$ in this relation may be a non-integer and is replaced by the fractal dimension $d_f$. `[2]`\n",
    "\n",
    "One example of a network where these power laws do not hold is the Cayley tree, also known as the Bethe lattice, which is a regular graph of fixed degree $z$ with no loops. An infinite Cayley tree cannot be embedded in a Euclidean space of finite dimensionality. The number of nodes at level $l$ grows as $n(l) \\sim (z - 1)^l$, which is faster than any power law, making Cayley trees infinite-dimensional systems.\n",
    "\n",
    "![Photo from [2]](https://i.imgur.com/VLxR3AL.png)\n",
    "_Taken from [2]_\n",
    "\n",
    "Many random network models have locally tree-like structure (since most loops occur only when $n(l) \\sim N$), and since the number of nodes grows as $n(l) \\sim \\langle k - 1 \\rangle^l$, they are also infinite dimensional. As a result, the diameter of such graphs (i.e., the shortest path between the most distant nodes) scales as $D \\sim \\ln N$ `[13]`. Many properties of ER networks, including the logarithmic diameter, are also present in Cayley trees. This small diameter is in contrast to that of finite-dimensional lattices, where $D \\sim N^{1/d_l}$.\n",
    "\n",
    "Like ER networks, percolation on infinite-dimensional lattices and the Cayley tree exhibits a critical threshold $p_c = 1/(z - 1)$. For $p > p_c$, a \"giant cluster\" of size $N$ exists, while for $p < p_c$, only small clusters are present. At criticality ($p = p_c$) in infinite-dimensional lattices (similar to ER networks), the giant component is of size $N^{2/3}$. This result follows from the fact that percolation on lattices in dimension $d \\geq d_c = 6$ is in the same universality class as infinite-dimensional percolation, where the fractal dimension of the giant cluster is $d_f = 4$, resulting in a size of the giant cluster that scales as $N^{d_f/d_c} = N^{2/3}$. The dimension $d_c$ is known as the \"upper critical dimension,\" and this concept exists not only in percolation phenomena, but also in other physical models such as the self-avoiding walk model for polymers and the Ising model for magnetism, in both of which $d_c = 4$. `[2]`\n",
    "\n",
    "### Watts-Strogatz model\n",
    "\n",
    "In the year 1998, Watts and Strogatz presented a novel model for small-world networks in their seminal work `[3]`. This model preserves the high degree of local clustering, which is a characteristic of lattice structures where the neighbors of a node are more likely to be neighbors with each other than in random graphs. The model achieves a reduction in the diameter of the network to $D \\sim \\ln N$ by randomly rewiring a fraction $\\varphi$ of the links in a regular lattice to connect to distant nodes. The rewiring procedure is based on a probability $p$ assigned to each edge. If an edge is selected for rewiring, it is substituted with a new edge chosen at random with uniform probability. The resulting network is characterized by $N$ nodes, $k$ nearest neighbors, and an average distance of $\\log(N)/\\log(k)$.\n",
    "\n",
    "More details on the Watts-Strogatz model can be found in `[3]`.\n",
    "\n",
    "## Random graphs as a model of real networks\n",
    "\n",
    "Many physical and man-made systems can be represented as networks, which consist of objects and the interactions between them. Some examples include computer networks, such as the Internet, and logical networks, including the links between web pages and email networks, where the presence of an individual's address in another person's address book is represented by a link. Additionally, social interactions in populations or work relationships and movements of a system in a configuration space can also be described using a network. These examples, along with many others, possess a graph structure that can be studied. Although many of these networks exhibit some ordered structure, such as cluster and group formation, geographical or geometrical considerations, or specific properties, most of them possess complex and random structures that deviate from regular lattices. As a result, it is often assumed, with caution, that they share properties with random graph models.\n",
    "\n",
    "Scale-free networks can be considered as a generalization of Erdős-Rényi (ER) networks. When $\\gamma > 4$ for large $\\gamma$, the properties of scale-free networks, such as distances, optimal paths, and percolation, are similar to those in ER networks. Conversely, when $\\gamma < 4$, these properties exhibit anomalous behavior due to the strong heterogeneity in the degrees of nodes, which disrupts the node-to-node translational homogeneity (symmetry) present in classical homogeneous networks such as lattices, Cayley trees, and ER graphs. `[2]`\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section of the notebook, I will delve into the technical aspect of analyzing the properties of real-world networks that were discussed previously. I will make use of the networkx library, a Python-based tool for constructing, manipulating, and studying the structure, dynamics, and functions of complex networks. However, some algorithms required manual implementation and can be found in the utils.py file for further information.\n",
    "\n",
    "The computations were executed on an Arch Linux machine with a AMD Ryzen 5 2600 processor (6 cores and 12 threads) and 16 GB of RAM. The code was written in Python 3.10.9, and the required packages can be installed by executing the following command in the terminal:\n",
    "\n",
    "```bash\n",
    "pip3 install -r requirements.txt\n",
    "```\n",
    "\n",
    "I have made efforts to ensure that the code is widely compatible, but I was unable to test it on a Windows machine. In the event that any issues are encountered, please ~~install Linux~~ inform me so I can work towards resolving them."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Discovering the datasets\n",
    "\n",
    "To perform our analysis, we will use the following datasets:\n",
    "\n",
    "- **Brightkite**\n",
    "- **Gowalla**\n",
    "- **Foursquare**\n",
    "\n",
    "We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
    "\n",
    "```\n",
    "data\n",
    "├── brightkite\n",
    "│   ├── brightkite_checkins_full.txt\n",
    "│   └── brightkite_friends_edges.txt\n",
    "├── foursquare\n",
    "│   ├── foursquare_checkins_full.txt\n",
    "│   ├── foursquare_friends_edges.txt\n",
    "│   └── raw_POIs.txt\n",
    "└── gowalla\n",
    "    ├── gowalla_checkins_full.txt\n",
    "    └── gowalla_friends_edges.txt\n",
    "```\n",
    "\n",
    "If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n",
    "\n",
    "> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 3 minutes to download all the datasets.\n",
    "\n",
    "---\n",
    "\n",
    "### A deeper look at the datasets\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is a long and boring function to automatically download, extract, rename and save the datasets in a clean way. If you want to have a deeper look at the code, you can find it in utils.py\n",
    "\n",
    "download_datasets() # it takes about 3-4 minutes to download and extract the datasets with a fiber connection"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a deeper look at them.\n",
    "\n",
    "## Brightkite\n",
    "\n",
    "[Brightkite](http://www.brightkite.com/) was a location-based social networking service that allowed users to share their locations by checking in. The friendship network data was collected using the Brightkite public API. There are two datasets available for analysis: \n",
    "\n",
    "- `brightkite_checkins_full.txt`, which contains check-in data in the form of a tab-separated file with five columns: `user id`, `check-in time`, `latitude`, `longitude`, and `location id`\n",
    "  \n",
    "- `brightkite_friends_edges.txt`, which is a tab-separated file with two columns containing user IDs and representing the friendship network in the form of a graph edge list. \n",
    "\n",
    "The `brightkite_checkins_full.txt` dataset must be converted into a graph before it can be analyzed properly, while the `brightkite_friends_edges.txt` dataset is already in a usable form for graph analysis.\n",
    "\n",
    "Let's have a more clear view of where our data have been generated"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# It takes about 2 minutes to run this block\n",
    "\n",
    "df_brighkite = pd.read_csv(os.path.join('data', 'brightkite', 'brightkite_checkins_full.txt'), \n",
    "                sep='\\t', \n",
    "                header=None,\n",
    "                names=['user id', 'check-in time', 'latitude', 'longitude', 'location id'],\n",
    "                parse_dates=['check-in time'],\n",
    "                engine='pyarrow')\n",
    "\n",
    "# take only the dates from 2009\n",
    "df_brighkite = df_brighkite[df_brighkite['check-in time'].dt.year == 2009]\n",
    "\n",
    "# convert the dataframe to geopandas dataframe\n",
    "gdf_brightkite = gpd.GeoDataFrame(df_brighkite, geometry=gpd.points_from_xy(df_brighkite.longitude, df_brighkite.latitude))\n",
    "\n",
    "# plot the geopandas dataframe\n",
    "print(\"Number of unique users: \", len(df_brighkite['user id'].unique()))\n",
    "gdf_brightkite.plot(marker='o', color='blue', markersize=1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Familiar shape, isn't it? As we can see there are ~35k nodes, a bit too much for our future computation. Let's take a subset, like Europe!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_brightkite = gdf_brightkite[gdf_brightkite['latitude'] < 60]\n",
    "gdf_brightkite = gdf_brightkite[gdf_brightkite['latitude'] > 35]\n",
    "gdf_brightkite = gdf_brightkite[gdf_brightkite['longitude'] < 30]\n",
    "gdf_brightkite = gdf_brightkite[gdf_brightkite['longitude'] > -10]\n",
    "\n",
    "gdf_brightkite.plot(marker='o', color='blue', markersize=1)\n",
    "\n",
    "# update the pandas dataframe with the new values\n",
    "df_brighkite = gdf_brightkite\n",
    "print(\"Number of unique users in Europe: \", len(df_brighkite['user id'].unique()))\n",
    "\n",
    "# remove from memory the geopandas dataframe, it was only used for plotting\n",
    "del gdf_brightkite"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perfect! Now we can create a new .txt file, only with the information that we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# update the file with the new values. Drop the columns that are not needed\n",
    "df_brighkite.to_csv(\n",
    "    os.path.join('data', 'brightkite', 'brightkite_checkins.txt'), \n",
    "    sep='\\t', \n",
    "    header=False, \n",
    "    index=False, \n",
    "    columns=['user id', 'location id'])\n",
    "\n",
    "# I prefer not to delete the full dataset, since it's bad practice in my opinion. If you want to delete it, uncomment the following line\n",
    "\n",
    "# os.remove(os.path.join('data', 'brightkite', 'brightkite_checkins_full.txt'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gowalla\n",
    "\n",
    "Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets. This is how they look like after being filtered by the `download_dataset` function:\n",
    "\n",
    "- `data/gowalla/gowalla_checkins.txt`: the checkins,  a tsv file with 5 columns: `user id`, `check-in time`, `latitude`, `longitude`, `location id`\n",
    "\n",
    "- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This file it's in the form of a graph edge list. \n",
    "\n",
    "--- \n",
    "\n",
    "Let's have a more clear view of where our data have been generated"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_gowalla = pd.read_csv(os.path.join('data', 'gowalla', 'gowalla_checkins_full.txt'),\n",
    "                            sep='\\t', \n",
    "                            header=None,\n",
    "                            names=['user id', 'check-in time', 'latitude', 'longitude', 'location id'],\n",
    "                            parse_dates=['check-in time'],\n",
    "                            engine='pyarrow')\n",
    "\n",
    "# take only the dates from 2009\n",
    "df_gowalla = df_gowalla[df_gowalla['check-in time'].dt.year == 2009]\n",
    "\n",
    "# convert the dataframe to geopandas dataframe\n",
    "gdf_gowalla = gpd.GeoDataFrame(df_gowalla, geometry=gpd.points_from_xy(df_gowalla.longitude, df_gowalla.latitude))\n",
    "\n",
    "# plot the geopandas dataframe\n",
    "gdf_gowalla.plot(marker='o', color='red', markersize=1)\n",
    "print(\"Number of unique users: \", len(df_gowalla['user id'].unique()))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is still a bit too much, to help us in the next sections, let's take a subset of the European area"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_gowalla = gdf_gowalla[gdf_gowalla['latitude'] < 60]\n",
    "gdf_gowalla = gdf_gowalla[gdf_gowalla['latitude'] > 35]\n",
    "gdf_gowalla = gdf_gowalla[gdf_gowalla['longitude'] < 30]\n",
    "gdf_gowalla = gdf_gowalla[gdf_gowalla['longitude'] > -10]\n",
    "\n",
    "gdf_gowalla.plot(marker='o', color='red', markersize=1)\n",
    "\n",
    "df_gowalla = gdf_gowalla\n",
    "print(\"Number of unique users in the EU area: \", len(df_gowalla['user id'].unique()))\n",
    "\n",
    "# remove from memory the geopandas dataframe, it was only used for plotting\n",
    "del gdf_gowalla"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perfect! Now we can create a new .txt file, only with the information that we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# update the file with the new values. Drop the columns that are not needed\n",
    "df_gowalla.to_csv(\n",
    "    os.path.join('data', 'gowalla', 'gowalla_checkins.txt'), \n",
    "    sep='\\t', \n",
    "    header=False, \n",
    "    index=False, \n",
    "    columns=['user id', 'location id'])\n",
    "\n",
    "# os.remove(os.path.join('test_data', 'brightkite', 'brightkite_checkins_full.txt'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Foursquare\n",
    "\n",
    "[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 22 months from Apr. 2012 to Jan. 2014) global-scale check-in data collected from Foursquare, and also two snapshots of user social networks before and after the check-in data collection period (see more details in the reference paper). We will work with three different datasets `[15]`:\n",
    "\n",
    "- `foursquare_checkins_full.txt`: a tsv file with 4 columns: `User ID`, `Venue ID`, `UTC time`, `Timezone offset in minutes`  \n",
    "\n",
    "- `foursquare_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This is in the form of a graph edge list. \n",
    "\n",
    "- `raw_POIs.txt`: the POIS, a tsv file with 5 columns: `Venue ID`, `Latitude`, `Longitude`, `Venue category name`, `Country code (ISO)`.\n",
    "\n",
    "--- \n",
    "\n",
    "The check-in dataset in consideration, with a size that surpasses that of the other three datasets obtained, comprises of [22,809,624] check-ins made by [114,324] users at [3,820,891] venues. Additionally, the social network data consists of [607,333] friendships. As previously indicated, the need for sub-sampling arises due to the size of the full network. In this instance, we shall restrict our analysis to data generated in Italy in the year 2012. Given the substantial size of the full network, plotting it would likely result in an unfavorable outcome, as the available RAM may become exhausted and the kernel may be forced to terminate the process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_foursquare_POIS = pd.read_csv(os.path.join('data', 'foursquare', 'raw_POIs.txt'), \n",
    "                                    sep='\\t',\n",
    "                                    header=None,\n",
    "                                    names=['venue id', 'latitude', 'longitude', 'venue category name', 'ISO code'],\n",
    "                                    dtype={'venue id': str, 'latitude': float, 'longitude': float, 'venue category name': str, 'ISO code': str},\n",
    "                                    engine='c')\n",
    "\n",
    "df_foursquare_checkins = pd.read_csv(os.path.join('data', 'foursquare', 'foursquare_checkins_full.txt'),\n",
    "                                        sep='\\t',\n",
    "                                        header=None,\n",
    "                                        names=['user id', 'venue id', 'UTC time', 'offset'],\n",
    "                                        dtype={'user id': str, 'venue id': str, 'UTC time': str, 'offset': int},\n",
    "                                        engine='c')\n",
    "\n",
    "# Take only the data with IT ISO code\n",
    "df_foursquare_POIS = df_foursquare_POIS[df_foursquare_POIS['ISO code'] == 'IT']\n",
    "\n",
    "# Take only the checkins that are in the POIs (filtered by ISO code) and viceversa\n",
    "df_foursquare_checkins = df_foursquare_checkins[df_foursquare_checkins['venue id'].isin(df_foursquare_POIS['venue id'])]\n",
    "df_foursquare_POIS = df_foursquare_POIS[df_foursquare_POIS['venue id'].isin(df_foursquare_checkins['venue id'])]\n",
    "\n",
    "# Convert to datetime\n",
    "df_foursquare_checkins['UTC time'] = pd.to_datetime(df_foursquare_checkins['UTC time'])\n",
    "\n",
    "# Take only the data from 2012\n",
    "df_foursquare_checkins = df_foursquare_checkins[df_foursquare_checkins['UTC time'].dt.year == 2012]\n",
    "\n",
    "# convert the dataframe to geopandas dataframe\n",
    "gdf_foursquare_POIS = gpd.GeoDataFrame(df_foursquare_POIS, geometry=gpd.points_from_xy(df_foursquare_POIS.longitude, df_foursquare_POIS.latitude))\n",
    "\n",
    "# plot the geopandas dataframe\n",
    "print(\"Starting to plot\")\n",
    "gdf_foursquare_POIS.plot(marker='o', color='red', markersize=1)\n",
    "print('Number of unique users in Italy: ', len(df_foursquare_checkins['user id'].unique()))\n",
    "\n",
    "# delete from memory the geo dataframe, it was only used for plotting\n",
    "del gdf_foursquare_POIS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_foursquare_checkins.to_csv(\n",
    "    os.path.join('data', 'foursquare', 'foursquare_checkins.txt'),\n",
    "    sep='\\t',\n",
    "    header=False,\n",
    "    index=False,\n",
    "    columns=['user id', 'venue id'])\n",
    "\n",
    "# os.remove(os.path.join('test_data', 'foursquare', 'foursquare_checkins_full.txt'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building the networks"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The construction of networks for the three datasets will be accomplished by representing them as an undirected graph $M = (V, E)$, with $V$ denoting the set of nodes and $E$ denoting the set of edges. The nodes will correspond to the users and the edges will indicate the presence of at least one instance where two individuals visited the same location.\n",
    "\n",
    "Since the check-ins files of the three datasets are not in the format of a graph edge list, it is necessary to manipulate them. Thus, we will examine the number of lines in each file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_lines_and_unique_elements(file):\n",
    "    df = pd.read_csv(file, sep='\\t', header=None)\n",
    "    print('Number of lines: ', len(df))\n",
    "    print('Number of unique elements: ', len(df[0].unique()))\n",
    "\n",
    "gowalla_path = os.path.join('data', 'gowalla', 'gowalla_checkins.txt')\n",
    "brightkite_path = os.path.join('data', 'brightkite', 'brightkite_checkins.txt')\n",
    "foursquare_path = os.path.join('data', 'foursquare', 'foursquare_checkins.txt')\n",
    "\n",
    "_ = [gowalla_path, brightkite_path, foursquare_path]\n",
    "\n",
    "for path in _:\n",
    "    print(path.split(os.sep)[-2])\n",
    "    count_lines_and_unique_elements(path)\n",
    "    print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to construct a graph from an edge list. To accomplish this, we will examine the users that have visited each venue. Subsequently, we will establish an edge between every pair of users who have visited the same venue, while avoiding duplications. This process can be executed efficiently (algorithmically speaking) in Python, although it may entail a slow computational time due the nature of the language itself. To mitigate this issue, we are considering subsampling the data sets. The methodology for creating this graph is illustrated below in the Python code snippet.\n",
    "\n",
    "```python\n",
    "# let df be the dataframe [\"user_id\", \"venue_id\"] of the checkins\n",
    "\n",
    "venues_users = df.groupby(\"venue_id\")[\"user_id\"].apply(set)\n",
    "\n",
    "        for users in venues_users:\n",
    "            for user1, user2 in combinations(users, 2):\n",
    "                G.add_edge(user1, user2)\n",
    "```\n",
    "\n",
    "The code makes use of a dataframe, `df`, which contains the `user_id` and `venue_id` information for each check-in. The code first groups the check-ins by the `venue_id` and applies a set function to the `user_id` values. Then, the code iterates through each set of users that visited the same venue and adds an edge between every pair of users.\n",
    "\n",
    "I have included a function in the `utils.py` module that performs this process automatically. The function, named `create_graph_from_checkins`, takes as input the name of the data set and returns a graph object in the NetworkX library. By default, this function also writes the edge list to a file in the respective data set folder. The available options for the input data set are \"brightkite\", \"gowalla\", and \"foursquare\". An example of how to use this function is shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_brighkite_checkins = create_graph_from_checkins('brightkite')\n",
    "G_brighkite_checkins.name = 'Brightkite Checkins Graph'\n",
    "\n",
    "G_gowalla_checkins = create_graph_from_checkins('gowalla')\n",
    "G_gowalla_checkins.name = 'Gowalla Checkins Graph'\n",
    "\n",
    "G_foursquare_checkins = create_graph_from_checkins('foursquare')\n",
    "G_foursquare_checkins.name = 'Foursquare Checkins Graph'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Friendship network\n",
    "\n",
    "\n",
    "We want to construct a friendship graph that represents the relationships between users in a social network. The concept of friendship will be modeled in accordance with the paradigm of Facebook, as opposed to Twitter. Consequently, the graph will be undirected and edges will not be weighted. Moreover, it is imperative to note that a user cannot be friends with himself, nor can he be friends with another user if the latter is not friends with him.\n",
    "\n",
    "The friendship graph will be generated using the function `create_friendships_graph` located in the `utils.py` module. The function takes as input the name of the dataset and returns a networkx graph object. By default, the edge list will also be written to a file within the respective dataset folder. The available options for the input dataset are _brightkite_, _gowalla_, and _foursquare_.\n",
    "\n",
    "> It is worth mentioning that this function has been implemented in a manner that does not require the checkins graph to be loaded in memory. Instead, it utilizes the edge list file. This was done with the consideration that some users may only perform analysis on the friendship network, and as such, there is no need to load the checkins graph and waste memory. Furthermore, networkx has been observed to be significantly slow when loading a graph from an edge list file.\n",
    "\n",
    "In conclusion, the implementation and usage of the create_friendships_graph function is demonstrated as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G_brighkite_friends = create_friendships_graph('brightkite')\n",
    "print(\"Computation done for Brightkite friendship graph\")\n",
    "G_brighkite_friends.name = 'Brightkite Friendship Graph'\n",
    "\n",
    "\n",
    "G_gowalla_friends = create_friendships_graph('gowalla')\n",
    "print(\"Computation done for Gowalla friendship graph\")\n",
    "G_gowalla_friends.name = 'Gowalla Friendship Graph'\n",
    "\n",
    "\n",
    "G_foursquare_friends = create_friendships_graph('foursquare')\n",
    "print(\"Computation done for Foursquare friendship graph\")\n",
    "G_foursquare_friends.name = 'Foursquare Friendship Graph'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our graphs, let's have a look at some basic information about them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for G in [G_brighkite_friends, G_gowalla_friends, G_foursquare_friends]:\n",
    "    print(G.name)\n",
    "    print('Number of nodes: ', G.number_of_nodes())\n",
    "    print('Number of edges: ', G.number_of_edges())\n",
    "    print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The networks under investigation are more abstract than previously observed networks as they lack geographical information. To gain insight into their structure, we can use a Newtonian gravity model to examine the interactions between the nodes. Given that the plot will be presented on a web page in the form of an html file, it is necessary to sample the network extensively. Although this process will not provide an exact representation of the network, it will provide a general understanding of the distribution of nodes. To achieve this, we will use a subsample of approximately 1000 nodes from the largest connected component. \n",
    "\n",
    "This task can be performed through the `visualize_graphs` function in the `utils.py` module. The function requires as inputs a networkx graph object, a `k` percentage of nodes to remove, and a `connected` boolean that specifies whether to only consider the largest connected component. By default, `k` is set to obtain a subsample of around 1000 nodes, and connected is set to `True`. The function outputs an html file that can be opened in a web browser. All files will be downloaded to the `html_graphs` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "friendships_graph = [G_brighkite_friends, G_gowalla_friends, G_foursquare_friends]\n",
    "\n",
    "for graph in friendships_graph:\n",
    "    visualize_graphs(graph, k = None, connected=True)\n",
    "\n",
    "# if we are curios about the checkins graphs, nothing prevents us to visualize them. Just uncomment the following lines\n",
    "\n",
    "# checkins_graph = [G_brighkite_checkins, G_gowalla_checkins, G_foursquare_checkins]\n",
    "# for graph in checkins_graph:\n",
    "#     visualize_graphs(graph, k = None, connected=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On a unix environment, if firefox is installed, we visualize them by running the following command:\n",
    "\n",
    "```bash\n",
    "firefox html_graphs/*.html\n",
    "```\n",
    "\n",
    "Otherwise, I have already made this computations and the results can be visualized following the links below\n",
    "\n",
    "- [Brightkite](https://lukefleed.xyz/graphs/brightkite_friendship_graph.html)\n",
    "- [Gowalla](https://lukefleed.xyz/graphs/gowalla_friendship_graph.html)\n",
    "- [Foursquare](https://lukefleed.xyz/graphs/foursquare_friendship_graph.html)\n",
    "\n",
    "> **EXTRA:** If you want to see a visualization of a complete different graph, here you can check che collaboration network of the actors on the IMDb website. It has very distinct communities and clusters. Only actors with more then 100 movies have been considered. Click [here](https://lukefleed.xyz/graph/imdb-graph.html) to see the visualization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Properties of the networks\n",
    "\n",
    "In order to effectively visualize the outcomes of our analysis, we will construct a dataframe that encapsulates all the information retrieved from the networks under examination. It should be noted that the full networks, despite having undergone filtering, are still substantial in size, which results in prolonged execution times for the functions utilized. To mitigate this issue, we will take sub-samples. It is important to keep in mind that the accuracy of the results is proportional to the size of the sample utilized.\n",
    "\n",
    "In light of these considerations, I recommend conducting an initial review of the notebook with higher values of the sampling rate to expedite the display of the results and gain an understanding of the functionality of the implemented functions. At the end of this section I provided a link to my GitHub repository, where the results obtained through lower sampling rates can be downloaded. This approach allows for a preliminary assessment of the functionality of the functions with mock-networks, before proceeding with the analysis using the more precise results that necessitate longer computation times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "checkins_graphs = [G_brighkite_checkins, G_gowalla_checkins, G_foursquare_checkins]\n",
    "friendships_graph = [G_brighkite_friends, G_gowalla_friends, G_foursquare_friends]\n",
    "\n",
    "graphs_all = checkins_graphs + friendships_graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis_results = pd.DataFrame(columns=['Graph', 'Number of Nodes', 'Number of Edges', 'Average Degree', 'Average Clustering Coefficient', 'log N', 'Average Shortest Path Length', 'betweenness centrality', 'omega-coefficient'], index=None)\n",
    "\n",
    "for graph in graphs_all:\n",
    "    analysis_results = analysis_results.append(\n",
    "        {'Graph': graph.name, \n",
    "        'Number of Nodes': graph.number_of_nodes(), \n",
    "        'log N': np.log(graph.number_of_nodes()),\n",
    "        'Number of Edges': graph.number_of_edges()}, \n",
    "        ignore_index=True)\n",
    "\n",
    "analysis_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Average Degree\n",
    "\n",
    "The concept of degree refers to the number of links that are connected to a particular node. While the average degree is a basic measure, it is not deemed to be of significant utility for our upcoming analysis, thus it will not be given extensive consideration. In contrast, the degree distribution, represented by $P(k)$, which represents the proportion of nodes that have a degree of $k$, is deemed to be a more meaningful metric. The literature on network analysis indicates that real-world networks often do not adhere to the Poisson degree distribution that is predicted by the ER model. Instead, many networks exhibit a degree distribution with a long-tailed, power-law distribution, such that $P(k) \\sim k^{-\\gamma}$, with a value of $\\gamma$ typically ranging from $2$ to $3$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for G in graphs_all:\n",
    "    avg_deg = np.mean([d for n, d in G.degree()])\n",
    "    analysis_results.loc[analysis_results['Graph'] == G.name, 'Average Degree'] = avg_deg\n",
    "\n",
    "analysis_results[['Graph', 'Average Degree']]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering coefficient\n",
    "\n",
    "The Clustering Coefficient `[2]` refers to the concept of communities represented by local structures in a network. This notion is generally related to the number of triangles present in the network and considered high when two nodes sharing a common neighbor exhibit a high probability of being connected. There are two commonly accepted definitions of clustering: the global definition and the local definition.\n",
    "\n",
    "The global definition of clustering is expressed mathematically as follows:\n",
    "\n",
    "$$ C = \\frac{3 \\times \\text{the number of triangles in the network}}{\\text{the number of connected triples of vertices}}$$\n",
    "\n",
    "where a “connected triple” represents a vertex with edges connecting to an unordered pair of other vertices.\n",
    "\n",
    "The local definition of clustering, on the other hand, is based on the average of the clustering coefficient of individual nodes. The clustering coefficient for a single node is defined as the fraction of pairs of its linked neighbors to the total number of pairs of its neighbors. This relationship can be mathematically represented as:\n",
    "\n",
    "$$ C_i = \\frac{\\text{the number of triangles connected to vertex }i}{\\text{the number of triples centered on vertex } i} $$\n",
    "\n",
    "For vertices with degree $0$ or $1$, the numerator and denominator of the equation are both equal to zero, and in such cases, $C_i = 0$. The clustering coefficient for the whole network is then obtained as the average of $C_i$ as expressed in the equation below:\n",
    "\n",
    "$$ C = \\frac{1}{n} \\sum_{i} C_i $$\n",
    "\n",
    "It is important to note that the clustering coefficient is always in the range of $0 \\leq C \\leq 1$. In random graph models such as the ER model and the configuration model, the clustering coefficient is low and decreases to zero as the network size increases. This is also observed in many growing network models. However, many real-world networks exhibit a high clustering coefficient that remains constant even for large network sizes.\n",
    "\n",
    "> This phenomenon led to the introduction of the small-world model `[3]`, which combines the properties of a regular lattice with high clustering and a random graph.\n",
    "\n",
    "---\n",
    "\n",
    "The library `networkx` provides a function to compute the clustering coefficient of a graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for graph in graphs_all:\n",
    "    print(\"\\nComputing average clustering coefficient for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
    "    avg_clustering = nx.average_clustering(graph)\n",
    "    end = time.time()\n",
    "    print(\"\\tAverage clustering coefficient: {}\".format(avg_clustering))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Clustering Coefficient'] = avg_clustering\n",
    "\n",
    "analysis_results[['Graph', 'Average Clustering Coefficient']]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Average Path Length\n",
    "\n",
    "In the context of our network analysis, it is important to note that networks are not embedded in physical space and thus, the geometrical distance between nodes becomes irrelevant. Instead, the most pertinent measure of distance in such networks is the minimal number of hops, also known as the chemical distance. This distance between two nodes is defined as the number of edges in the shortest path connecting the nodes.\n",
    "\n",
    "--- \n",
    "\n",
    "The `networkx` library offers the `average_shortest_path_length` function to calculate the average shortest path length of a graph. However, there are certain limitations to this function. It is only applicable to connected graphs and there is a chance that the subsample of the dataset used may not be connected. Additionally, this operation can be computationally expensive. The average shortest path length is calculated using the formula:\n",
    "\n",
    "$$ a = \\sum_{s \\in V} \\sum_{t \\in V} \\frac{d(s,t)}{n(n-1)} $$\n",
    "\n",
    "where $V$ represents the set of nodes in the graph, $n$ represents the number of nodes, and $d(s,t)$ is the shortest path length between nodes $s$ and $t$. The default algorithm used to calculate the shortest path length is the Dijkstra algorithm.\n",
    "\n",
    "Given the size of the datasets, computing the average shortest path length for the entire graph is not feasible. To overcome this, we can use the `average_shortest_path` function from the utils module to compute the average shortest path length of a random subsample of the graph. This function requires the input of the networkx graph object and an optional parameter `k` which represents the percentage of nodes to remove from the graph. If `k` is set to None, the average shortest path length of each connected component is calculated using all the nodes of the component. The function returns the average shortest path length of the graph.\n",
    "\n",
    "The implementation involves first removing a random subsample of nodes from the graph, creating a list of connected components with at least 10 nodes, and then using the `average_shortest_path_length` function to calculate the average shortest path length. The choice of 10 nodes is arbitrary and based on empirical observations, as small communities with low average shortest path lengths can skew results. The value of `k` can be adjusted based on the available computing resources and time, with lower values providing more precise results but taking longer to compute and vice versa. However, in this case, the computation time is not overly excessive, so if we are willing to wait a few minutes, we can use the default value of `k` which is `None`.\n",
    "\n",
    "\n",
    "<!-- We have seen how we can characterize the clustering in a small world network. Now we can see the second important property of small-world networks is their small diameter, i.e., the small distance between nodes in the network. The distance in the underlying lattice behaves as the linear length of the lattice, L. Since $N \\sim L^d$  where $d$ is the lattice dimension, it follows that the distance between nodes behaves as:\n",
    "\n",
    "\\begin{equation}\n",
    "    l \\sim L \\sim N^{1/d}\n",
    "\\end{equation}\n",
    "\n",
    "Therefore, the underlying lattice has a finite dimension, and the distances on it behave as a power law of the number of nodes, i.e., the distance between nodes is large. However, when adding even a small fraction of shortcuts to the network, this behavior changes dramatically. \n",
    "\n",
    "Let's try to deduce the behavior of the average distance between nodes. Consider a small-world network, with dimension d and connecting distance $k$ (i.e., every node is connected to any other node whose distance from it in every linear dimension is at most $k$). Now, consider the nodes reachable from a source node with at most $r$ steps. When $r$ is small, these are just the \\emph{r-th} nearest neighbors of the source in the underlying lattice. We term the set of these neighbors a “patch”. the radius of which is $kr$ , and the number of nodes it contains is approximately $n(r) = (2kr)d$. \n",
    "\n",
    "We now want to find the distance r for which such a patch will contain about one shortcut. This will allow us to consider this patch as if it was a single node in a randomly connected network. Assume that the probability for a single node to have a shortcut is $\\Phi$. To find the length for which approximately one shortcut is encountered, we need to solve for $r$ the following equation: $(2kr)^d \\Phi = 1$. The correlation length $\\xi$ defined as the distance (or linear size of a patch) for which a shortcut will be encountered with high probability is therefore,\n",
    "\n",
    "\\begin{equation}\n",
    "    \\xi = \\frac{1}{k \\Phi^{1/d}}\n",
    "\\end{equation}\n",
    "\n",
    "Note that we have omitted the factor 2, since we are interested in the order of magnitude. Let us denote by $V(r)$ the total number of nodes reachable from a node by at most $r$ steps, and by $a(r)$, the number of nodes added to a patch in the \\emph{r-th} step. That is, $a(r) = n(r) - n(r-1)$. Thus,\n",
    "\n",
    "\\begin{equation}\n",
    "    a(r) \\sim \\frac{\\text{d} n(r)}{\\text{d} r} = 2kd(2kr)^{d-1}\n",
    "\\end{equation}\n",
    "\n",
    "When a shortcut is encountered at the r step from a node, it leads to a new patch \\footnote{It may actually lead to an already encountered patch, and two patches may also merge after some steps, but this occurs with negligible probability when $N \\to \\infty$ until most of the network is reachable}. This new patch occurs after $r'$ steps, and therefore the number of nodes reachable from its origin is $V (r - r')$. Thus, we obtain the recursive relation\n",
    "\n",
    "\\begin{equation} \n",
    "    V(r) = \\sum_{r'=0}^r a(r') [1 + \\xi^{-d}V(r-r')]\n",
    "\\end{equation}\n",
    "\n",
    "where the first term stands for the size of the original patch, and the second term is derived from the probability of hitting a shortcut, which is approximately $\\xi -d $ for every new node encountered. To simplify the solution of \\ref{eq:recursion}, it can be approximated by a differential equation. The sum can be approximated by an integral, and then the equation can be differentiated with respect to $r$ . For simplicity, we will concentrate here on the solution for the one-dimensional case, with $k = 1$, where $a(r) = 2$. Thus, one obtains\n",
    "\n",
    "\\begin{equation}\n",
    "    \\frac{\\text{d} V(r)}{\\text{d} r} = 2 [1 + V(r)/\\xi]\n",
    "\\end{equation}\n",
    "\n",
    "the solution of which is:\n",
    "\n",
    "\\begin{equation} \n",
    "    V(r) = \\xi \\left(e^{2r/\\xi} -1\\right)\n",
    "\\end{equation}\n",
    "\n",
    "For $r \\ll \\xi$, the exponent can be expanded in a power series, and one obtains $V(r) \\sim 2r = n(r)$, as expected, since usually no shortcut is encountered. For $r \\ gg \\xi$, $V(r)$. An approximation for the average distance between nodes can be obtained by equating $V(r)$ from \\ref*{eq:V(r)} to the total number of nodes, $V(r) = N$. This results in\n",
    "\n",
    "\\begin{equation} \n",
    "    r \\sim \\frac{\\xi}{2} \\ln \\frac{N}{\\xi} \n",
    "\\end{equation}\n",
    " -->\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if you want just to test it out, leave k = 0.6, it will only take a few seconds. More accurate results will be available to download after\n",
    "\n",
    "for graph in graphs_all:\n",
    "    print(\"\\nComputing average shortest path length for graph: \", graph.name)\n",
    "\n",
    "    start = time.time()\n",
    "    average_shortest_path_length = average_shortest_path(graph, k = 0.6)\n",
    "    end = time.time()\n",
    "\n",
    "    print(\"\\tAverage shortest path length: {}\".format(round(average_shortest_path_length,2)))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
    "    \n",
    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'Average Shortest Path Length'] = average_shortest_path_length\n",
    "\n",
    "analysis_results[['Graph', 'Average Shortest Path Length']]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Betweenness Centrality\n",
    "\n",
    "In a network, the significance of a node is dependent on a multitude of factors. The importance of a website could stem from its content while that of a router could stem from its capacity. However, these properties are contingent upon the type of network being studied and may have limited correlation with the graph structure of the network. We are interested on the importance of a node or link in terms of its topological function within the network. It is reasonable to infer that the topology of a network may intrinsically dictate the significance of different nodes. One measure of centrality is the degree of a node, where the higher the degree, the greater the node's connectivity, and thus, its centrality in the network. Nevertheless, the degree is not the sole determinant of a node's significance.\n",
    "\n",
    "A commonly accepted definition of centrality is based on path counting. For each node, `i`, in the network, the number of routing paths to all other nodes that traverse `i` is counted, and this number determines node `i`'s centrality. The most conventional approach is to consider only the shortest paths as routing paths, resulting in the following definition: the _betweenness centrality_ of node `i`, represented by $g(i)$, is equal to the number of shortest paths between all node pairs in the network that traverse it, as expressed in the equation below:\n",
    "\n",
    "$$ g(i) = \\sum_{{ j,k }} g_i (j,k) $$\n",
    "\n",
    "where the notation ${j, k}$ represents the summation of each pair once, ignoring the order, and $g_i(j, k)$ equals $1$ if the shortest path between nodes `j` and `k` passes through node `i` and $0$ otherwise. In networks with no weight, i.e., networks where all edges have the same length, there might be more than one shortest path. In such cases, it is common practice to take $g_i(j, k) = C_i(j,k)/C(j,k)$, where $C(j,k)$ is the number of shortest paths between `j` and `k` and $C_i(j,k)$ is the number of those passing through `i`.\n",
    "\n",
    "> There are several variations of this scheme, particularly focusing on the counting of distinct shortest paths, if multiple shortest paths share some edges. However, these differences tend to have minimal statistical impact in random complex networks, where the number of short loops is limited. Thus, this project will concentrate on the above definition. Another consideration is whether the source and destination are considered part of the shortest path.\n",
    "\n",
    "--- \n",
    "\n",
    "The networkx library, which is a commonly used library for network analysis, includes a function for computing the betweenness centrality of all nodes in a network. This function is based on the algorithm proposed by Ulrik Brandes in `[16]`, which involves the calculation of shortest paths between all pairs of nodes in the network and counting the number of shortest paths that pass through each node.\n",
    "\n",
    "However, the computation of this algorithm on large networks may not be feasible within a reasonable time frame due to the computational cost. To mitigate this issue, a sampling approach can be employed, which provides approximate results. Nevertheless, even with heavy sampling, the computation time remains prohibitively high. To avoid further sampling, which would introduce bias, we will use a parallelization approach to speed up the computation.\n",
    "\n",
    "In the `utils` module, I have implemented a function called `betweenness_centrality_parallel` that uses this approach. The function takes as input a networkx graph object, the number of processes to use for computation (default is 1, which uses the standard betweenness algorithm), and the percentage of nodes to remove from the graph (default is `None`, which uses all nodes of the connected component to compute the average shortest path length). The function divides the network into _chunks_ of nodes and computes their contribution to the betweenness centrality of the whole network in parallel, ultimately returning a dictionary of the betweenness centrality of each node.\n",
    "\n",
    "In the `utils` module I implemented a function called `betweenness_centrality_parallel`. The function takes as input\n",
    "\n",
    "Please note that for large graphs, it is advisable to not use more than 6 processes to avoid memory constraints. The number of processes to use can be determined based on the available time and the machine being used. For small graphs, more processes may be used. As for the percentage of nodes to remove, lower values provide more precise results but take longer to compute, while higher values result in less precise results but are faster to compute. It is suggested to start with `k=0.6` for a quick test and use `k=0.2` for a more precise result. For more information, refer to the function code in the `utils` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for graph in graphs_all:\n",
    "    print(\"\\nComputing the approximate betweenness centrality for the {}...\".format(graph.name))\n",
    "    start = time.time()\n",
    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(graph, 6, k = 0.5).values()))\n",
    "    end = time.time()\n",
    "    print(\"\\tBetweenness centrality: {} \".format(betweenness_centrality))\n",
    "    print(\"\\tCPU time: \" + str(round(end-start,1)) + \" seconds\")\n",
    "\n",
    "    analysis_results.loc[analysis_results['Graph'] == graph.name, 'betweenness centrality'] = betweenness_centrality\n",
    "\n",
    "analysis_results[['Graph', 'betweenness centrality']]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the dataframe with accurate results\n",
    "\n",
    "All the results from the previous section are available in the following dataframe. Each function as been executed using as less sampling as possible, some of them took hours to complete. The dataframe is available in the block below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(os.path.join('server_results', 'analysis_results.pkl')):\n",
    "    print(\"Downloading the analysis results file...\")\n",
    "    wget.download('https://github.com/lukefleed/small-worlds/raw/main/server_results/analysis_results.pkl', out='server_results')\n",
    "\n",
    "analysis_results = pd.read_pickle('analysis_results.pkl')\n",
    "analysis_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis of the results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distribution of Degree\n",
    "\n",
    "In the preceding section, we established that a scale-free network exhibits a skewed distribution of node degrees, resulting from a few nodes possessing a significantly higher number of connections compared to the majority of nodes. Such networks contain \"hubs\" or high-degree nodes that play a disproportionate role in the structure and function of the network. Conversely, a random network showcases a more uniform distribution of node degrees with nodes possessing approximately the same number of connections.\n",
    "\n",
    "---\n",
    "\n",
    "We shall now determine if our networks are scale-free or not. To this end, we utilize the `degree_distribution` function from the `utils` module to plot the degree distribution of a graph. The function accepts a networkx graph object as input and returns a plot of the degree distribution. We anticipate observing a power-law distribution, rather than a Poissonian distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for G in checkins_graphs:\n",
    "    degree_distribution(G, log = True) # I suggest to use log = True, it is more readable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for graph in friendships_graph:\n",
    "    degree_distribution(graph, log = False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the graphs, we observe that the degree distribution of the networks is not Poissonian, but instead exhibits a scale-free pattern. This implies that the networks are not random, but rather exhibit the characteristics of small-world networks.\n",
    "\n",
    "To verify the results, we plot the degree distribution of a random Watts-Strogatz graph, created with the same number of nodes and with a probability of edge formation equal to the number of edges in the network divided by the total number of possible edges. We expect to observe a Poissonian distribution in this case.\n",
    "\n",
    "> Note that this approach is only time-saving and not rigorous. For a rigorous analysis, we must follow the algorithm proposed by Maslov and Sneppen and implement it using the `random_reference` function in the NetworkX library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for graph in checkins_graphs:\n",
    "\n",
    "    p = G.number_of_edges() / (G.number_of_nodes())\n",
    "    avg_degree = int(np.mean([d for n, d in G.degree()]))\n",
    "    G = nx.watts_strogatz_graph(G.number_of_nodes(), avg_degree, p)\n",
    "    G.name = graph.name + \" - Watts-Strogatz similarity\"\n",
    "\n",
    "    print(G.name)\n",
    "    print(\"Number of nodes: \", G.number_of_nodes())\n",
    "    print(\"Number of edges: \", G.number_of_edges())\n",
    "    degree_distribution(G, log=False)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for graph in friendships_graph:\n",
    "\n",
    "    p = G.number_of_edges() / (G.number_of_nodes())\n",
    "    avg_degree = int(np.mean([d for n, d in G.degree()]))\n",
    "    G = nx.watts_strogatz_graph(G.number_of_nodes(), avg_degree, p)\n",
    "    G.name = graph.name + \" - Watts-Strogatz similarity\"\n",
    "\n",
    "\n",
    "    print(G.name)\n",
    "    print(\"Number of nodes: \", G.number_of_nodes())\n",
    "    print(\"Number of edges: \", G.number_of_edges())\n",
    "    degree_distribution(G, log=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a Poissonian distribution, as expected."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Small-World Model\n",
    "\n",
    "It is imperative to note that real networks are not random, but rather the result of a multitude of processes and influences including natural limitations, human considerations, and economic factors among others. The degree to which random models accurately depict real-world networks remains a topic of debate. Nonetheless, this section focuses on random network models and investigates if their properties may still be applicable to the study of real-world networks.\n",
    "\n",
    "The ER model fails to explain several properties of real-world networks, such as the high clustering. To address this issue, Watts and Strogatz proposed an alternative model, referred to as the “small-world” model `[3]`. The model aims to combine the characteristics of ordered lattices with those of random graphs. According to Watts and Strogatz, quoting their words:\n",
    "\n",
    "> small-world networks exhibit high clustering like regular lattices and small characteristic path lengths like random graphs. \n",
    "\n",
    "The model begins with an ordered lattice, such as the $k$-ring or the two-dimensional lattice, and rewires links with probability $\\varphi$. The result is a network with specialized nodes or regions and shared or distributed processing across all communicating nodes.\n",
    "\n",
    "![small-world](https://i.imgur.com/gX4Eutx.png)\n",
    "![small-world](https://i.imgur.com/B95uS5O.png)\n",
    "\n",
    "_Pictures from `[2]`_"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Small-Worldness"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to ascertain whether the small-world properties of networks, characterized by their high clustering and low path length, are a universal property of naturally occurring networks or restricted to specific networks. The broad definition of small-worldness may overlook the specific advantages of these networks, resulting in the misidentification of networks more similar to regular lattices and random networks as small-world. A commonly accepted definition of small-world networks is that they exhibit clustering coefficients comparable to a regular lattice and path lengths comparable to a random network. However, this definition can lead to networks with low clustering coefficients being incorrectly classified as small-world. Therefore, a more stringent method is necessary to differentiate true small-world networks from those resembling random or lattice structures, which, although interesting in their own right, do not exhibit the characteristics of small-world networks.`[6]`\n",
    "\n",
    "## Identifying small-world networks\n",
    "\n",
    "Small-world networks are distinguished from other networks by two specific properties, the first being high clustering ($C$) among nodes. High clustering supports specialization as local collections of strongly interconnected nodes readily share information or resources. Conceptually, clustering is quite straightforward to comprehend. In a real-world analogy, clustering represents the probability that one’s friends are also friends of each other. Small-world networks also have short path lengths ($L$) as is commonly observed in random networks. The path length is a measure of the distance between nodes in the network, calculated as the mean of the shortest geodesic distances between all possible node pairs. Small values of $L$ ensure that information or resources easily spreads throughout the network. This property makes distributed information processing possible on technological networks and supports the six degrees of separation often reported in social networks. `[6]`\n",
    "\n",
    "The WS model `[3]` demonstrates that random rewiring of a small percentage of the edges in a lattice results in a precipitous decrease in the path length, but only trivial reductions in the clustering. Across this rewiring probability, there is a range where the discrepancy between clustering and path length is very large, and it is in this area that the benefits of small-world networks are realized.\n",
    "\n",
    "### A first approach: the $\\sigma$ coefficient\n",
    "\n",
    "In 2006, Humphries et al. `[8][9]` proposed the small-world coefficient $\\sigma$ as a quantitative metric for network analysis. This metric quantifies the relationship between network clustering, represented by $C$, and path length, represented by $L$, in comparison to their respective random network equivalents, $C_{rand}$ and $L_{rand}$. To calculate $\\sigma$, the authors computed the ratios $c = C/C_{rand}$ and $k = L/L_{rand}$ and arrived at the following equation:\n",
    "\n",
    "$$ \\sigma = \\frac{C/C_{rand}}{L/L_{rand}} = \\frac{c}{k} $$\n",
    "\n",
    "A network is classified as small-world if $C \\gg C_{rand}$ and $L \\approx L_{rand}$, resulting in $\\sigma > 1$. However, a limitation of this metric is its susceptibility to the clustering coefficient of the equivalent random network. As clustering in random networks is typically low `[3]`, small changes in $C_{rand}$ can significantly impact the value of $\\sigma$.\n",
    "\n",
    "### A more solid approach: the $\\omega$ coefficient\n",
    "\n",
    "The small-world measurement, $\\omega$, quantifies the structural properties of a graph with characteristic path length, $L$, and clustering, $C$. The metric calculates the difference between the ratio of the graph's path length to that of an equivalent random network, $L_{rand}$, and the ratio of the graph's clustering to that of an equivalent lattice network, $C_{latt}$; as expressed in the following equation:\n",
    "\n",
    "$$ \\omega = \\frac{L_{rand}}{L} - \\frac{C}{C_{latt}} $$\n",
    "\n",
    "The utilization of the clustering of an equivalent lattice network, rather than a random network, renders $\\omega$ less susceptible to fluctuations in $C_{rand}$. Furthermore, values of $\\omega$ are restricted to the interval $-1$ to $1$ regardless of network size.\n",
    "\n",
    "Graphs with $\\omega$ values close to zero are considered small world, while positive values denote more random-like characteristics and negative values denote more regular, or lattice-like, characteristics..\n",
    "\n",
    "#### Lattice network construction\n",
    "\n",
    "The paper `[11]` presents the generation of a lattice network through the application of a modified version of the \"latticization\" algorithm `[10]` as reported in the brain connectivity toolbox by Rubinov and Sporns (2010). This procedure, based on a Markov-chain algorithm, preserves node degree while swapping edges with uniform probability, under the condition that the resulting matrix has entries closer to the main diagonal.\n",
    "\n",
    "In order to optimize the clustering coefficient of the lattice network, the latticization process undergoes several repetitions until clustering is maximized. The algorithm involves storing the initial adjacency matrix and its clustering coefficient. The latticization procedure is then performed on the matrix. If the clustering coefficient of the resulting matrix is higher, it replaces the initial adjacency matrix. If it is lower, the latticization process repeats on the initial matrix.\n",
    "\n",
    "This process results in a highly clustered network with long path length, approximating a lattice topology. To reduce processing time in larger networks, the authors developed a \"sliding window\" procedure. The procedure involves sampling smaller sections of the matrix along the main diagonal, performing the latticization process, and reinserting the result into the larger matrix in a step-wise manner.\n",
    "\n",
    "\n",
    "#### Limitations\n",
    "\n",
    "The latticization procedure, as described by Sporns and Zwi in 2004, exhibits limitations in its applicability to large networks such as the Internet. Due to the computational demands, the latticization of such networks may take several days to generate and optimize.\n",
    "\n",
    "Additionally, the latticization algorithm is limited by networks that possess low clustering and lack the capacity for appreciable improvement. Such networks include those with 'super hubs' or hierarchical structures. Hierarchical networks often feature nodes that are configured in branches with minimal clustering, while networks with 'super hubs' contain a node with a degree magnitude significantly greater than that of the next most connected node. These configurations restrict the options for increasing the clustering of the network. Furthermore, a targeted attack on these networks can easily destroy `[17]` its topology, indicating a potential lack of small-world properties."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Omega coefficient computation: standard procedure\n",
    "\n",
    "The computation of the Omega Coefficient necessitates a time-consuming process. To accurately assess the clustering coefficient and the shortest path length, one must construct both the lattice reference network and the random reference network several times. The following algorithm outlines the computation of the Omega Coefficient:\n",
    "\n",
    "1. Generate a random sample of the network.\n",
    "2. Execute a specified number of rewiring operations per edge to compute the equivalent random graph.\n",
    "3. Calculate the average clustering coefficient and average shortest path length for a specified number of random graphs and then average the results.\n",
    "4. Compute the Omega Coefficient for the random sample using the standard formula\n",
    "\n",
    "Despite the described technique, the computation of the Omega Coefficient remains computationally intensive. To mitigate over-sampling and potential bias, the computation was performed on a subset of the network with cardinality $\\frac{|N|}{2}$. Additionally, both the number of rewiring operations per edge and the number of random graphs were set to $3$.\n",
    "\n",
    "Even with these optimizations, the computation of the Omega Coefficient required several days to complete. The computation was executed on a remote server, and the results are accessible in the form of a pandas dataframe (as described in the subsequent section)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "In the repository there is a python program `omega_sampled_server.py` that can be used to compute the omega coefficient for a network as described above. You can run it as follows:\n",
    "\n",
    "```bash\n",
    "./omega_sampled_server.py graph k niter nrand\n",
    "\n",
    "# Example:\n",
    "./omega_sampled_server.py checkins-brightkite 0.5 --nrand 3 --niter 3\n",
    "```\n",
    "\n",
    "Where: \n",
    "\n",
    "- `graph` is the name of the graph\n",
    "- `k` Percentage of nodes to be remove\n",
    "- `niter` Number of rewiring operations per edge\n",
    "- `nrand` Number of random graphs to be generated\n",
    "\n",
    "For further details run `./omega_sampled_server.py --help`\n",
    "\n",
    "> **NOTE:** This are slow operations, do not try to run them with higher values of k, niter or nrand. The computation of this networks with k=0.5, niter=3 and nrand=3 requires from 3 to 10 days to complete. If you want to test it out, you can use the `checkins-brightkite` graph with k=0.1, niter=1 and nrand=1.\n",
    "\n",
    "The advantage of using an external script rather then a block in the notebook is the ease of parallelization. You can run more scripts in parallel for different datasets. This can easily be automated with a bash script. I won't report the code since it's note relevant to the topic of this project.\n",
    "\n",
    "In the next section, we will see the results obtained in detail, trying to understand what they mean.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Omega coefficient computation: parallelization approach\n",
    "\n",
    "The algorithm described above can be easly parallelized. Since we want to compute `nrand` times the random reference and the lattice, we can use different processes to compute them in parallel. This can be done with the `multiprocessing` module in python. \n",
    "\n",
    "It the repository there is a python program `omega_parallel_server.py` that can be used to compute the omega coefficient for a network as described above. You can run it as follows:\n",
    "\n",
    "```bash\n",
    "./omega_sampled_parallel.py graph --k --niter --nrand --n_processes --seed\n",
    "```\n",
    "\n",
    "Where the only difference with the previous script is the `--n_processes` argument that specifies the number of processes to be used. I suggest to use the default option that uses all the available threads. If we use a number of `nrand` that is less then or equal to the number of threads, the time needed to compute the omega coefficient will be the same as choosing `nrand=1` with the previous script. \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Are our networks small-world?\n",
    "\n",
    "There are multiple factors to take into consideration. Let's try to recap what we know about the networks we are working with:\n",
    "\n",
    "- Degree distribution\n",
    "- Average clustering coefficient\n",
    "- Average shortest path length\n",
    "- Betweenness centrality\n",
    "- Omega coefficient\n",
    "\n",
    "## Degree distribution\n",
    "\n",
    "The degree distribution of a real-world network can characterize the small-world property by showing a balance between the number of highly connected nodes (high degree) and the number of less connected nodes (low degree). A network with a small-world property will have a few highly connected nodes (hubs) and a large number of nodes with a relatively low number of connections. This creates a balance between the number of highly connected nodes and the number of less connected nodes, which allows for efficient information flow and rapid spreading of information throughout the network. Additionally, the degree distribution of a small-world network will typically follow a power-law distribution, with a few highly connected nodes and a large number of less connected nodes, further emphasizing the small-world property.\n",
    "\n",
    "As we have seen from the sections before, the distribution presented is far form Poissonian, and very close to a power law. However, the degree distribution alone is not enough to state that a real-world network is a small-world network because it does not take into account the specific relationships and interactions between the nodes in the network. A random network can also have a similar degree distribution, but the relationships between the nodes would be different from those in a small-world network.\n",
    "\n",
    "For example, a random network could be generated by randomly connecting nodes together without considering any specific relationships between them. In this case, the degree distribution may be similar to that of a social network, but the relationships between the nodes would be different.\n",
    "\n",
    "Additionally, to recreate this degree distribution with a random network, we can use the Barabasi-Albert model. This model generates a random network with a power-law degree distribution, which is similar to the degree distribution found in many real-world networks, including small-world networks. This model simulates the growth process of a network, where new nodes are added to the network and they preferentially connect to the existing nodes that have a high degree, this leads to a power-law degree distribution which is similar to the degree distribution of many small-world networks.\n",
    "\n",
    "## Betweenness centrality\n",
    "\n",
    "The betweenness centrality of a node in a network measures the number of times that node acts as a bridge or intermediary between other nodes in the network. In a small-world network, nodes have a high betweenness centrality because they often act as intermediaries between distant nodes, allowing for short paths and efficient communication between distant parts of the network. Therefore, a high degree of betweenness centrality in a network can be used to characterize its small-world propriety.\n",
    "\n",
    "To determine if the average betweenness centrality of a network is high or not we can compare it with the theoretical values of random networks. As the betweenness centrality is a measure of how much a node is used as a bridge between other nodes, random networks tend to have a low value of betweenness centrality. If the average betweenness centrality of our network is higher than the theoretical values of a random network, it can be considered a high value and therefore the network is more likely to be a small-world network.\n",
    "\n",
    "Let's test it out with our networks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# As said before, for a quick testing I suggest to use k=0.6 and at least k=0.4 for accurate results\n",
    "\n",
    "# uncomment the model that you want to use for the random graphs\n",
    "# model_name = 'watts_strogatz'\n",
    "model_name = 'erdos_renyi'\n",
    "\n",
    "random_graphs = {}\n",
    "for graph in graphs_all:\n",
    "    G = create_random_graphs(graph, model=model_name, save = False)\n",
    "    print(\"Random graph created for \", graph.name, \"\\nStarting computation of betweenness centrality...\")\n",
    "    betweenness_centrality = np.mean(list(betweenness_centrality_parallel(G, 6, k = 0.4).values()))\n",
    "    print(\"\\tBetweenness centrality for Erdos-Renyi random graph: \", betweenness_centrality)\n",
    "    random_graphs[graph.name] = betweenness_centrality\n",
    "    print(\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(15, 10))\n",
    "index = np.arange(len(random_graphs))\n",
    "bar_width = 0.35\n",
    "opacity = 0.8\n",
    "\n",
    "rects1 = plt.bar(index, analysis_results['betweenness centrality'], bar_width,\n",
    "alpha=opacity,\n",
    "color='b',\n",
    "label='Original Graph')\n",
    "\n",
    "rects2 = plt.bar(index + bar_width, random_graphs.values(), bar_width,\n",
    "alpha=opacity,\n",
    "color='g',\n",
    "label='Random Graph')\n",
    "\n",
    "plt.xlabel('Graph')\n",
    "plt.ylabel('Betweenness Centrality')\n",
    "plt.title('Betweenness Centrality of the original graph and the random graph')\n",
    "plt.xticks(index + bar_width, random_graphs.keys())\n",
    "plt.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, there is a clear difference between the betweenness centrality of the networks generated from the checkins and the networks generated from the friendships. Since the values of the betweenness centrality of the networks generated from the checkins are higher than the theoretical values of a random network, we can conclude that the networks generated from the checkins are more likely to be a small-world network. On the other hand, the networks generated from the friendships have a lower value of betweenness centrality than the theoretical values of a random network, therefore we can conclude that the networks generated from the friendships are less likely to be a small-world network.\n",
    "\n",
    "This propriety appears both with the erdos-renyi and the watts-strogatz models."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering coefficient\n",
    "\n",
    "The simplest way `[5]` to treat clustering analytically in a small-world network is to use the link addition, rather than the rewiring model. In the limit of large network size, $N \\to \\infty$, and for a fixed fraction of shortcuts $\\phi$, it is clear that the probability of forming triangle vanishes as we approach $1/N$, so the contribution of the shortcuts to the clustering is negligible. Therefore, the clustering of a small-world network is determined by its underlying ordered lattice. For example, consider a ring where each node is connected to its $k$ closest neighbors from each side. A node's number of neighbors is therefore $2k$, and thus it has $2k(2k - 1)/2 = k(2k - 1)$ pairs of neighbors. Consider a node, $i$. All of the $k$ nearest nodes on $i$'s left are connected to each other, and the same is true for the nodes on $i$'s right. This amounts to $2k(k - 1)/2 = k(k - 1)$ pairs. Now consider a node located $d$ places to the left of $k$. It is also connected to its $k$ nearest neighbors from each side. Therefore, it will be connected to $k - d$ neighbors on $i$'s right side. The total number of connected neighbor pairs is\n",
    "\n",
    "\\begin{equation}\n",
    "    k(k-1) + \\sum_{d=1}^k (k-d) = k(k-1) + \\frac{k(k-1)}{2} = \\frac{3}{2} k (k-1)\n",
    "\\end{equation}\n",
    "\n",
    "and the clustering coefficient is:\n",
    "\n",
    "\\begin{equation}\n",
    "    C = \\frac{\\frac{3}{2}k(k-1)}{k(2k-1)} =\\frac{3 (k-1)}{2(2k-1)}\n",
    "\\end{equation}\n",
    "\n",
    "For every $k > 1$, this results in a constant larger than $0$, indicating that the clustering of a small-world network does not vanish for large networks. For large values of $k$, the clustering coefficient approaches $3/4$, that is, the clustering is very high. Note that for a regular two-dimensional grid, the  clustering by definition is zero, since no triangles exist. However, it is clear that the grid has a neighborhood structure. `[2]`\n",
    "\n",
    "\n",
    "--- \n",
    "\n",
    "We can compare the results of the clustering coefficient that we obtained with the standard formula, and the one that we obtained with the formula above. We can do that with the function `generalized_average_clustering_coefficient` in the `utils.py` file. The function takes as input a networkx graph object and returns a float: the average clustering coefficient of the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generalized_cc = {}\n",
    "for graph in graphs_all:\n",
    "    generalized_cc[graph.name] = generalized_average_clustering_coefficient(graph)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(15, 10))\n",
    "index = np.arange(len(generalized_cc))\n",
    "bar_width = 0.35\n",
    "opacity = 0.8\n",
    "\n",
    "rects1 = plt.bar(index, analysis_results['Average Clustering Coefficient'], bar_width,\n",
    "alpha=opacity,\n",
    "color='b',\n",
    "label='Standard Clustering')\n",
    "\n",
    "rects2 = plt.bar(index + bar_width, generalized_cc.values(), bar_width,\n",
    "alpha=opacity,\n",
    "color='g',\n",
    "label='Generalized Clustering')\n",
    "\n",
    "plt.xlabel('Graph')\n",
    "plt.ylabel('Average Clustering Coefficient')\n",
    "plt.title('Average Clustering Coefficient of the original graph and the generalized graph')\n",
    "plt.xticks(index + bar_width, generalized_cc.keys())\n",
    "plt.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, for the graphs generated from the checkins, the two values are very similar. However, for the graphs generated from the friendships, the values are very different. This is another suggestion that the checkins graphs are more likely to be a small-world network than the friendships graphs. \n",
    "\n",
    "But this is not enough to jump to conclusions"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion: Omega coefficient\n",
    "\n",
    "We have already discussed a lot in the previous sections about this measure, let's see the results that we obtained after days of computations on the server:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis_results[['Graph', 'omega-coefficient']]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To give you a better idea of how time consuming is this computation, I will report below the time that it took to compute the omega coefficient for the networks generated from all this networks:\n",
    "\n",
    "<!-- create a table -->\n",
    "\n",
    "| Network | Time |\n",
    "|:-------:|:----:|\n",
    "| Brightkite Checkins | 9d 11h 25m |\n",
    "| Gowalla Checkins | 3d 2h 55m |\n",
    "| FourSquare Checkins | 6d 14h 13m |\n",
    "| Brightkite Friendships | 17h 55m |\n",
    "| Gowalla Friendships | 2h 22m |\n",
    "| FourSquare Friendships | 2h 9m |\n",
    "\n",
    "Note that due to the small size of the friendships graphs, I have been able to compute the omega coefficent for the whole networks. However, for the checkins graphs, I had to take a 50% sample of the nodes. In both cases, I used `niter` and `nrand` equal to 3."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This results are a bit of a surprise. The small-world coefficient (omega) measures how much a network is like a lattice or a random graph. Negative values mean the graph is similar to a lattice whereas positive values mean the graph is more random-like. Values close to 0 instead, should represent small-world characteristics.\n",
    "\n",
    "Based only on this metric, we may conclude that all the networks are small-worlds. In fact, all the values of the omega coefficient are ~$0.2$ (with the exception of the foursquare checkins graph, whose value is very close to $0$). However, I don't think this is the case. \n",
    "\n",
    "We have seen in the previous section that the $\\omega$ coefficient can be tricked by networks that have a very low clustering coefficient, and in my opinion this is exactly what is happening here. The networks generated from the friendships have a very low clustering coefficient, and therefore they are biasing the $\\omega$ coefficient. This conclusion is supported by the fact the measures like the betweenness centrality and the clustering coefficient that we have shown before, suggest that the networks generated from the friendships are not small-world networks. \n",
    "\n",
    "Furthermore, on a more heuristic level, those graphs represent a social network with data taken in 2010, a time when social networks were not as popular as they are today. Therefore, I would not be surprised if those networks are not small-worlds. \n",
    "\n",
    "On the other hand, on a more technical level, I think that using `niter` and `nrand` equal to $3$ is not enough to reach a definitive conclusion. However, choosing bigger values would have exponentially increased the time needed to compute the $\\omega$ coefficient and reducing the number of nodes in the sample would have reduced the accuracy of the results. \n",
    "\n",
    "---\n",
    "\n",
    "To summarize the work done: this study evidences why the characterization of the small-world propriety of a real-world network is still subject of debate. Even if we have used the most reliable techniques that the literature has to offer, we still have not been able to reach a definitive conclusion and specific observations on the single networks were necessary. For real networks, we still have not reached the completeness (in a metaphorical way, not topological) of the theoretical models firstly proposed in the 60s by Erdős and Rényi."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "> _In no particular order_\n",
    "\n",
    "`[1]` On the evolution of random graphs, P. Erdős, A. Rényi, _Publ. Math. Inst. Hungar. Acad. Sci._, 5, 17-61 (1960).\n",
    "\n",
    "`[2]` Complex Networks: Structure, Robustness, and Function, R. Cohen, S. Havlin, D. ben-Avraham, H. E. Stanley, _Cambridge University Press, 2009_.\n",
    "\n",
    "`[3]` Collective dynamics of 'small-world' networks, D. J. Watts and S. H. Strogatz, _Nature_, 393, 440-442, 1998.\n",
    "\n",
    "`[4]` On random graphs I, P. Erdős and A. Rényi, _Publ. Math. Inst. Hungar. Acad. Sci._, 5, 290-297, 1960.\n",
    "\n",
    "`[5]` Generalizations of the clustering coefficient to weighted complex networks, M. E. J. Newman, _Physical Review E_, 74, 036104, 2006.\n",
    "\n",
    "`[6]` The ubiquity of small-world networks. Telesford QK, Joyce KE, Hayasaka S, Burdette JH, Laurienti PJ. _Brain Connect_. 2011;1(5):367-75\n",
    "\n",
    "`[8]` Humphries and Gurney (2008). “Network ‘Small-World-Ness’: A Quantitative Method for Determining Canonical Network Equivalence”. PLoS One. 3 (4)\n",
    "\n",
    "`[9]` The brainstem reticular formation is a small-world, not scale-free, network M. D. Humphries, K. Gurney and T. J. Prescott, Proc. Roy. Soc. B 2006 273, 503-511,\n",
    "\n",
    "`[10]` Sporns, Olaf, and Jonathan D. Zwi. “The small world of the cerebral cortex.” Neuroinformatics 2.2 (2004): 145-162.\n",
    "\n",
    "`[11]` Maslov, Sergei, and Kim Sneppen. “Specificity and stability in topology of protein networks.” Science 296.5569 (2002): 910-913.\n",
    "\n",
    "`[13]` B. Bollob ́as, Random Graphs, 1985. London: Academic Press\n",
    "\n",
    "`[14]` R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, Resilience of the Internet to\n",
    "random breakdown, Physical Review Letters 85 (2000), 4626–4628 \n",
    "\n",
    "`[15]` Dingqi Yang, Bingqing Qu, Jie Yang, Philippe Cudre-Mauroux, Revisiting User Mobility and Social Relationships in LBSNs: A Hypergraph Embedding Approach, In Proc. of The Web  Conference (WWW'19). May. 2019, San Francisco, USA.\n",
    "\n",
    "`[16]` Ulrik Brandes, A Faster Algorithm for Betweenness Centrality, Journal of Mathematical Sociology, 25(2):163-177, 2001._\n",
    "\n",
    "`[17]` Error and attack tolerance of complex networks, R. Albert, Nature volume 406, pages378–382 (2000) \n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.8 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}