{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import os\n", "import wget\n", "import zipfile\n", "import numpy as np\n", "import pandas as pd\n", "import networkx as nx\n", "import plotly.graph_objects as go\n", "from utils import *\n", "from collections import Counter\n", "from tqdm import tqdm\n", "import time\n", "\n", "# ignore warnings\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Discovering the datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To perform our analysis, we will use the following datasets:\n", "\n", "- **Brightkite**\n", "- **Gowalla**\n", "- **Foursquare**\n", "\n", "We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n", "\n", "```\n", "data\n", "├── brightkite\n", "│ ├── brightkite_checkins.txt\n", "│ └── brightkite_friends_edges.txt\n", "├── foursquare\n", "│ ├── foursquare_checkins.txt\n", "│ ├── foursquare_friends_edges.txt\n", "│ └── raw_POIs.txt\n", "└── gowalla\n", " ├── gowalla_checkins.txt\n", " └── gowalla_friends_edges.txt\n", "```\n", "\n", "If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n", "\n", "> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 5 minutes to download all the datasets.\n", "\n", "---\n", "\n", "### A deeper look at the datasets\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "download_datasets()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a deeper look at them.\n", "\n", "## Brightkite\n", "\n", "[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. We will work with two different datasets. This is how they look like after being filtered by the `download_dataset` function:\n", "\n", "- `data/brightkite/brightkite_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This file it's untouched by the function, it's in the form of a graph edge list.\n", "\n", "\n", "- `data/brightkite/brightkite_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph. Originally there were other columns, but we will not use them." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Gowalla\n", "\n", "Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets. This is how they look like after being filtered by the `download_dataset` function:\n", "\n", "- `data/gowalla/gowalla_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list. Originally there were other columns, such as the time of the checkins. During the filtering, we used this information to extract only the checkins from 2009 and then deleted it. This is why the number of checkins is smaller than the original dataset. \n", "\n", "- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This file it's untouched by the function, it's in the form of a graph edge list. In the next section when we will build the friendship network, we will only consider the users that have at least one check-in in 2009 to avoid having biases in the analysis." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Foursquare\n", "\n", "[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 22 months from Apr. 2012 to Jan. 2014) global-scale check-in data collected from Foursquare, and also two snapshots of user social networks before and after the check-in data collection period (see more details in our paper). We will work with three different datasets:\n", "\n", "- `data/foursquare/foursquare_checkins.txt`: a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list. This fill will remain untouched by the function but due to its size, in the next sections we will focus on the EU sub-sample and the IT sub-sample. The friendship edge list will be modified accordingly.\n", "\n", "- `data/foursquare/foursquare_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids. This is in the form of a graph edge list. \n", "\n", "- `data/foursquare/raw_POIs.txt`: the POIS, a tsv file with 2 columns of location and country ISO. We are going to use this file to create the sub-samples of the dataset.\n", "\n", "> **NOTE:** In this case I preferred not to take sub-samples based on time. The reason is that there may be a period of time where the social network was not very popular in some countries, so the analysis may be biased. Instead, I decided to take sub-samples based on the country. In this way I have a more homogeneous dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building the networks" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We are asked to construct the networks for the three datasets as an undirected graph $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n", "\n", "The check-ins files of the three datasets are not in the form of a graph edge list, so we need to manipulate them. Let's have a look at the number of lines of each file (note that gowalla is already filtered, only 2009 data are present)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def count_lines_and_unique_elements(file):\n", " df = pd.read_csv(file, sep='\\t', header=None)\n", " print('Number of lines: ', len(df))\n", " print('Number of unique elements: ', len(df[0].unique()))\n", "\n", "gowalla_path = os.path.join('data', 'gowalla', 'gowalla_checkins.txt')\n", "brightkite_path = os.path.join('data', 'brightkite', 'brightkite_checkins.txt')\n", "foursquare_path = os.path.join('data', 'foursquare', 'foursquare_checkins.txt')\n", "\n", "_ = [gowalla_path, brightkite_path, foursquare_path]\n", "\n", "for path in _:\n", " print(path.split(os.sep)[-2])\n", " count_lines_and_unique_elements(path)\n", " print()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We would like to build a graph starting from an edge list. To do that, we are going to check, for each venue, all the users that visited it. Then, we will create an edge between each pair of users that visited the same venue (avoid repetitions). This can be easily done in python, but it's going to be a bit slow (this is why we are considering sub-samples of the datasets). Let's see how to do it.\n", "\n", "```python\n", "# let df be the dataframe [\"user_id\", \"venue_id\"] of the checkins\n", "\n", "venues_users = df.groupby(\"venue_id\")[\"user_id\"].apply(set)\n", "\n", " for users in venues_users:\n", " for user1, user2 in combinations(users, 2):\n", " G.add_edge(user1, user2)\n", "```\n", "\n", "It the `utilis.py` module, we have a function that does exactly this called `create_graph_from_checkins`. It takes as input the name of the dataset and returns a networkx graph object. By default it will also write the edge list to a file in the respective dataset folder. The options are\n", "\n", "- `brightkite`\n", "- `gowalla`\n", "- `foursquareEU`\n", "- `foursquareIT`\n", "\n", "Let's see how it works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It takes about 4 minutes to create the all the 4 graphs on a i7-8750H CPU\n", "\n", "G_brighkite_checkins = create_graph_from_checkins('brightkite')\n", "G_brighkite_checkins.name = 'Brightkite Checkins Graph'\n", "\n", "G_gowalla_checkins = create_graph_from_checkins('gowalla')\n", "G_gowalla_checkins.name = 'Gowalla Checkins Graph'\n", "\n", "G_foursquareEU_checkins = create_graph_from_checkins('foursquareEU')\n", "G_foursquareEU_checkins.name = 'Foursquare EU Checkins Graph'\n", "\n", "G_foursquareIT_checkins = create_graph_from_checkins('foursquareIT')\n", "G_foursquareIT_checkins.name = 'Foursquare IT Checkins Graph'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Friendship network\n", "\n", "Now we want to create a graph where two users are connected if they are friends in the social network. We are intending the concept of friendship in a \"facebook way\", not a \"twitter way\". Less empirically, the graphs is not going to be directed and the edges are not going to be weighted. A user can't be friend with himself, and can't be friend with a user without the user being friend with him.\n", "\n", "Since we filtered the checkins for foursquare and gowalla, we are considering only the users that are also present in the check-ins graph. We can build this graph with the function `create_friendships_graph` in the `utils.py` module. It takes as input the name of the dataset and returns a networkx graph object. By default it will also write the edge list to a file in the respective dataset folder. The options are\n", "\n", "- `brightkite`\n", "- `gowalla`\n", "- `foursquareEU`\n", "- `foursquareIT`\n", "\n", "> **NOTE:** This functions is implemented without the necessity of the checkins graphs being loaded in memory, it uses the edge list file. This choice was made since someone may want to perform some analysis only on the friendship network and so there is no need to load the checkins graph and waste memory. Furthermore, networkx is tremendously slow when loading a graph from an edge list file (since it's written in pure python), so this choice is also motivated by the speed of the function.\n", "\n", "Let's see how it works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G_brighkite_friends = create_friendships_graph('brightkite')\n", "print(\"Computation done for Brightkite friendship graph\")\n", "G_brighkite_friends.name = 'Brightkite Friendship Graph'\n", "\n", "G_gowalla_friends = create_friendships_graph('gowalla')\n", "print(\"Computation done for (filtered) Gowalla friendship graph\")\n", "G_gowalla_friends.name = '(Filtered) Gowalla Friendship Graph'\n", "\n", "G_foursquareIT_friends = create_friendships_graph('foursquareIT')\n", "print(\"Computation done for Foursquare IT friendship graph\")\n", "G_foursquareIT_friends.name = 'Foursquare IT Friendship Graph'\n", "\n", "G_foursquareEU_friends = create_friendships_graph('foursquareEU')\n", "print(\"Computation done for Foursquare EU friendship graph\")\n", "G_foursquareEU_friends.name = 'Foursquare EU Friendship Graph'\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our graphs, let's have a look at some basic information about them" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for G in [G_brighkite_friends, G_gowalla_friends, G_foursquareIT_friends, G_foursquareEU_friends]:\n", " print(G.name)\n", " print('Number of nodes: ', G.number_of_nodes())\n", " print('Number of edges: ', G.number_of_edges())\n", " print()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis of the structure of the networks\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Degree distribution\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "checkins_graphs = [G_brighkite_checkins, G_gowalla_checkins, G_foursquareEU_checkins, G_foursquareIT_checkins]\n", "\n", "for graph in checkins_graphs:\n", " degree_distribution(graph, log=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "\n", "Let's see how does it changes for the friendship networks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "friendships_graph = [G_brighkite_friends, G_gowalla_friends, G_foursquareIT_friends, G_foursquareEU_friends]\n", "\n", "for graph in friendships_graph:\n", " degree_distribution(graph, log=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We may be curious to see if the whole friendship network has a different degree distribution than the filtered one. Let's see if there are any differences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G1 = nx.read_edgelist('data/brightkite/brightkite_friends_edges.txt', nodetype=int)\n", "G1.name = 'Brightkite Friendship Graph'\n", "G2 = nx.read_edgelist('data/gowalla/gowalla_friends_edges.txt', nodetype=int)\n", "G2.name = 'Gowalla Friendship Graph'\n", "G3 = nx.read_edgelist('data/foursquare/foursquare_friends_edges.txt', nodetype=int)\n", "G3.name = 'Foursquare Friendship Graph'\n", "\n", "degree_distribution(G1, log=True)\n", "degree_distribution(G2, log=True)\n", "degree_distribution(G3, log=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, there are no notable differences, and this is not surprising. We where only taking into considerations some edge cases. Maybe in siberia this was a very popular social network, but since it's a very harsh environment, being friends on the social network was it's not synonymous of visiting the same places together (where do you go in siberia?). " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "\n", "Now, we can compute the average degree for each checkins graph and for the friendship graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a dataframe with the average degree for each graph\n", "average_degree = pd.DataFrame(columns=['Graph', 'Average Degree'], index=None)\n", "\n", "for graph in tqdm(checkins_graphs):\n", " average_degree = average_degree.append({'Graph': graph.name, 'Average Degree': np.mean(list(dict(graph.degree()).values()))}, ignore_index=True)\n", "\n", "for graph in tqdm(friendships_graph):\n", " average_degree = average_degree.append({'Graph': graph.name, 'Average Degree': np.mean(list(dict(graph.degree()).values()))}, ignore_index=True)\n", "\n", "print(average_degree) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering coefficient\n", "\n", "\n", "--- \n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analysis_results = pd.DataFrame(columns=['Graph', 'Number of Nodes', 'Number of Edges', 'Average Degree', 'Average Clustering Coefficient', 'log N', 'Average Shortest Path Length', 'betweenness centrality'], index=None)\n", "\n", "graphs_all = checkins_graphs + friendships_graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It's going to take a while (about 10 minutes). The time complexity is O(n^2) since we are iterating over all the nodes and their neighbors. \n", "\n", "clustering_results = pd.DataFrame(columns=['Graph', 'Average Clustering Coefficient'], index=None)\n", "\n", "for graph in friendships_graph:\n", " print(graph.name)\n", " clustering_results = clustering_results.append(\n", " {'Graph': graph.name, \n", " 'Number of Nodes': graph.number_of_nodes(),\n", " 'Number of Edges': graph.number_of_edges(),\n", " 'Average Clustering Coefficient': nx.average_clustering(graph),\n", " 'log N': np.log(graph.number_of_nodes()),\n", " 'Average Shortest Path Length': mean_shortest_path(graph), \n", " 'betweenness centrality': nx.betweenness_centrality(G)}, \n", " ignore_index=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(clustering_results)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use our formula to compute the clustering coefficient in a small world network" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Average Path Length\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Betweenness Centrality\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "betweenness_results = pd.DataFrame(columns=['Graph', 'Betweenness Centrality'])\n", "\n", "for graph in checkins_graphs:\n", " betweenness_results = betweenness_results.append(\n", " {'Graph': graph.name,\n", " 'Betweenness Centrality': np.mean(list(nx.betweenness_centrality(graph).values()))}, \n", " ignore_index=True)\n", "\n", "betweenness_results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def small_world_clustering(graph: nx.Graph):\n", " tmp = 0\n", " for node in tqdm(graph.nodes()):\n", " k = len(list(graph.neighbors(node)))\n", " if k >=1:\n", " tmp += (3*(k-1))/(2*(2*k-1))\n", " return tmp/graph.number_of_nodes()\n", "\n", "print(\"Clustering coefficient for the Watts-Strogatz graph: \", small_world_clustering(G_ws))\n", "\n", "print(\"Clustering coefficient for the Brightkite checkins graph: \", small_world_clustering(G_brighkite_checkins))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" } } }, "nbformat": 4, "nbformat_minor": 2 }