"# update the pandas datafram with the new values\n",
"# update the pandas dataframe with the new values\n",
"df_brighkite = gdf_brightkite\n",
"print(\"Number of unique users in Europe: \", len(df_brighkite['user id'].unique()))\n",
"\n",
@ -508,11 +508,11 @@
"\n",
"[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 22 months from Apr. 2012 to Jan. 2014) global-scale check-in data collected from Foursquare, and also two snapshots of user social networks before and after the check-in data collection period (see more details in our paper). We will work with three different datasets:\n",
"\n",
"- `data/foursquare/foursquare_checkins.txt`: a tsv file with 4 columns: `User ID`, `Venue ID`, `UTC time`, `Timezone offset in minutes` \n",
"- `foursquare_checkins.txt`: a tsv file with 4 columns: `User ID`, `Venue ID`, `UTC time`, `Timezone offset in minutes` \n",
"\n",
"- `data/foursquare/foursquare_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This is in the form of a graph edge list. \n",
"- `foursquare_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This is in the form of a graph edge list. \n",
"\n",
"- `data/foursquare/raw_POIs.txt`: the POIS, a tsv file with 5 columns: `Venue ID`, `Latitude`, `Longitude`, `Venue category name`, `Country code (ISO)`.\n",
"- `raw_POIs.txt`: the POIS, a tsv file with 5 columns: `Venue ID`, `Latitude`, `Longitude`, `Venue category name`, `Country code (ISO)`.\n",
"\n",
"--- \n",
"\n",
@ -10385,12 +10385,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can clearly see from the graphs obtained, the degree distribution of the networks is not Poissonian, but rather scale-free. This is a good indication that the networks are not random, but rather small-world.\n",
"\n",
"Let's try to plot the distribution degree of a random Erdos-Renyi graph with the same number of nodes and a probability of edge creation equal to the number of edges of the network divided by the number of possible edges. We expect to see a Poissonian distribution.\n",
"Let's try to plot the distribution degree of a random Watts-Strogatz graph with the same number of nodes and a probability of edge creation equal to the number of edges of the network divided by the number of possible edges. We expect to see a Poissonian distribution.\n",
"\n",
"> This is a time saving approach, NOT a rigorous one. If we want to be rigorous, should follow the algorithm proposed by Maslov and Sneppen, implemented in the the networkx function `random_reference`."
]
@ -13368,17 +13369,13 @@
}
],
"source": [
"# for each network, create a erdos-renyi model of the original. If you want to test it with the watts-strogatz model, uncomment the code below and comment the first 2 lines of the for loop\n",
"\n",
"for graph in checkins_graphs:\n",
"\n",
" p = G.number_of_edges() / (G.number_of_nodes())\n",
" avg_degree = int(np.mean([d for n, d in G.degree()]))\n",
" G = nx.watts_strogatz_graph(G.number_of_nodes(), avg_degree, p)\n",
" # G = nx.erdos_renyi_graph(graph.number_of_nodes(), nx.density(graph))\n",
" # G.name = graph.name + \" Erdos-Renyi\"\n",
" print(G.name)\n",
" print(\"Number of nodes: \", G.number_of_nodes())\n",
" print(\"Number of edges: \", G.number_of_edges())\n",
@ -16363,17 +16360,13 @@
}
],
"source": [
"# for each network, create a erdos-renyi model of the original graph. If you want to test it with the watts-strogatz model, uncomment the code below and comment the first 2 lines of the for loop\n",
"\n",
"for graph in friendships_graph:\n",
"\n",
" p = G.number_of_edges() / (G.number_of_nodes())\n",
" avg_degree = int(np.mean([d for n, d in G.degree()]))\n",
" G = nx.watts_strogatz_graph(G.number_of_nodes(), avg_degree, p)\n",
" # G = nx.erdos_renyi_graph(graph.number_of_nodes(), nx.density(graph))\n",
" # G.name = graph.name + \" Erdos-Renyi\" \n",
"\n",
" print(G.name)\n",
" print(\"Number of nodes: \", G.number_of_nodes())\n",
@ -16397,7 +16390,7 @@
"source": [
"## The Small-World Model\n",
"\n",
"It should be clarified that real networks are not random. Their formation and development are dictated by a combination of many different processes and influences. These influencing conditions include natural limitations and processes, human considerations such as optimal performance and robustness, economic considerations, natural selection and many others. Controversies still exist regarding the measure to which random models represent real-world networks. However, in this section we will focus on random network models and attempt to show if their properties may still be used to study properties of our real-world networks. \n",
"Let's start by clarified that real networks are not random. Their formation and development are dictated by a combination of many different processes and influences. These influencing conditions include natural limitations and processes, human considerations such as optimal performance and robustness, economic considerations, natural selection and many others. Controversies still exist regarding the measure to which random models represent real-world networks. However, in this section we will focus on random network models and attempt to show if their properties may still be used to study properties of our real-world networks. \n",
"\n",
"Many real-world networks have many properties that cannot be explained by the ER model. One such property is the high clustering observed in many real-world networks. This led Watts and Strogatz to develop an alternative model, called the “small-world” model. Quoting their paper:\n",
"\n",
@ -16427,7 +16420,7 @@
"\n",
"## Identifying small-world networks\n",
"\n",
"Small-world networks are distinguished from other networks by two specific properties, the first being high clustering (C) among nodes. High clustering supports specialization as local collections of strongly interconnected nodes readily share information or resources. Conceptually, clustering is quite straightforward to comprehend. In a real-world analogy, clustering represents the probability that one’s friends are also friends of each other. Small-world networks also have short path lengths (L) as is commonly observed in random networks. Path length is a measure of the distance between nodes in the network, calculated as the mean of the shortest geodesic distances between all possible node pairs. Small values of $L$ ensure that information or resources easily spreads throughout the network. This property makes distributed information processing possible on technological networks and supports the six degrees of separation often reported in social networks.\n",
"Small-world networks are distinguished from other networks by two specific properties, the first being high clustering ($C$) among nodes. High clustering supports specialization as local collections of strongly interconnected nodes readily share information or resources. Conceptually, clustering is quite straightforward to comprehend. In a real-world analogy, clustering represents the probability that one’s friends are also friends of each other. Small-world networks also have short path lengths ($L$) as is commonly observed in random networks. The path length is a measure of the distance between nodes in the network, calculated as the mean of the shortest geodesic distances between all possible node pairs. Small values of $L$ ensure that information or resources easily spreads throughout the network. This property makes distributed information processing possible on technological networks and supports the six degrees of separation often reported in social networks.\n",
"\n",
"Watts and Strogatz developed a network model (WS model) that resulted in the first-ever networks with clustering close to that of a lattice and path lengths similar to those of random networks. The WS model demonstrates that random rewiring of a small percentage of the edges in a lattice results in a precipitous decrease in the path length, but only trivial reductions in the clustering. Across this rewiring probability, there is a range where the discrepancy between clustering and path length is very large, and it is in this area that the benefits of small-world networks are realized.\n",
"\n",
@ -16452,69 +16445,10 @@
"\n",
"#### Limitations\n",
"\n",
"The length of time it takes to generate lattice networks, particularly for large networks.Although\n",
"latticization is fast in smaller networks, large networks such as functional brain networks and the Internet can take several\n",
"hours to generate and optimize. The latticization procedure described here uses an algorithm developed by Sporns and\n",
"The length of time it takes to generate lattice networks, particularly for large networks. Although latticization is fast in smaller networks, large networks such as functional brain networks and the Internet can take several days to generate and optimize. The latticization procedure described here uses an algorithm developed by Sporns and\n",
"Zwi in 2004, but the algorithm was used on much smaller datasets. \n",
"\n",
"Furthermore, $\\omega$ is limited by networks that have very low clustering that cannot be appreciably increased, such as networks with ‘‘super hubs’’ or hierarchical networks. In hierarchical networks, the nodes are often configured in branches\n",
"that contain little to no clustering. In networks with ‘‘super hubs,’’ the network may contain a hub that has a node with\n",
"a degree that is several times in magnitude greater than the next most connected hub. In both these networks, there are\n",
"fewer configurations to increase the clustering of the network. Moreover, in a targeted assault of these networks, the topology is easily destroyed (Albert et al., 2000). Such vulnerability to attack signifies a network that may not be small-world."
"Furthermore, $\\omega$ is limited by networks that have very low clustering that cannot be appreciably increased, such as networks with 'super hubs' or hierarchical networks. In hierarchical networks, the nodes are often configured in branches that contain little to no clustering. In networks with ‘‘super hubs,’’ the network may contain a hub that has a node with a degree that is several times in magnitude greater than the next most connected hub. In both these networks, there are fewer configurations to increase the clustering of the network. Moreover, in a targeted assault of these networks, the topology is easily destroyed (Albert et al., 2000). Such vulnerability to attack signifies a network that may not be small-world."
" if not os.path.exists(os.path.join(\"dataTMP\", folder)):\n",
" os.mkdir(os.path.join(\"dataTMP\", folder))\n",
"\n",
" # Download every url in their respective folder. For the last one, we have to use gdown, because it's a google drive link. If the file is already downloaded, skip the download\n",
"\n",
" for i in range(len(urls)):\n",
" for url in urls[i]:\n",
" if not os.path.exists(os.path.join(\"dataTMP\", folders[i], url.split(\"/\")[-1])):\n",
" # # Now we from the _totalCheckins.txt files we want to keep only the first and last column, which are the user ID and the venue ID. We also want to remove the header of the file.\n",
"\n",
" # for file in os.listdir(os.path.join(\"dataTMP\", \"brightkite\")):\n",
Thedatasetsaredownloadedinthe"data"folder.Ifthefolderdoesn't exist, it will be created. If the dataset is already downloaded, it will be skipped. The files are renamed to make them more readable.
# Download every url in their respective folder. For the last one, we have to use gdown, because it's a google drive link. If the file is already downloaded, skip the download
# Now we from the _totalCheckins.txt files we want to keep only the first and last column, which are the user ID and the venue ID. We also want to remove the header of the file.