You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
357 lines
12 KiB
Plaintext
357 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%reload_ext autoreload\n",
|
|
"\n",
|
|
"import os\n",
|
|
"import zipfile\n",
|
|
"import wget\n",
|
|
"import networkx as nx\n",
|
|
"from main import *\n",
|
|
"import pandas as pd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Discovering the datasets"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To perform our analysis, we will use the following datasets:\n",
|
|
"\n",
|
|
"- **Brightkite**\n",
|
|
"- **Gowalla**\n",
|
|
"- **Foursquare**\n",
|
|
"\n",
|
|
"We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
|
|
"\n",
|
|
"```\n",
|
|
"├── brightkite\n",
|
|
"│ ├── brightkite_checkins.txt\n",
|
|
"│ └── brightkite_friends_edges.txt\n",
|
|
"├── foursquare\n",
|
|
"│ ├── foursquare_checkins_NYC.txt\n",
|
|
"│ ├── foursquare_checkins_TKY.txt\n",
|
|
"└── gowalla\n",
|
|
" ├── gowalla_checkins.txt\n",
|
|
" └── gowalla_friends_edges.txt\n",
|
|
"```\n",
|
|
"\n",
|
|
"If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n",
|
|
"\n",
|
|
"> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 2 to 3 minutes to download all the datasets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"download_datasets()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's have a deeper look at them.\n",
|
|
"\n",
|
|
"## Brightkite\n",
|
|
"\n",
|
|
"[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. We will work with two different datasets:\n",
|
|
"\n",
|
|
"- `data/brightkite/brightkite_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
|
|
"- `data/brightkite/brightkite_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Gowalla\n",
|
|
"\n",
|
|
"Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets:\n",
|
|
"\n",
|
|
"- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
|
|
"- `data/gowalla/gowalla_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Foursquare\n",
|
|
"\n",
|
|
"[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 2 columns, which are:\n",
|
|
"\n",
|
|
"1. User ID (anonymized)\n",
|
|
"2. Venue ID (Foursquare)\n",
|
|
"\n",
|
|
"In this case, we don't have any information about the friendship network, so we will only work with the checkins."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Building the networks"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are asked to construct the networks for the three datasets as un undirected graph $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n",
|
|
"\n",
|
|
"And this is were the fun begins! The check-ins files of the three datasets are not in the form of a graph edge list, so we need to manipulate them. But those datasets are huge! Let's have a look at the number of lines of each file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"gowalla\n",
|
|
"Number of lines: 6442892\n",
|
|
"Number of unique elements: 107092\n",
|
|
"\n",
|
|
"brightkite\n",
|
|
"Number of lines: 4747287\n",
|
|
"Number of unique elements: 51406\n",
|
|
"\n",
|
|
"foursquare\n",
|
|
"Number of lines: 227428\n",
|
|
"Number of unique elements: 1083\n",
|
|
"\n",
|
|
"foursquare\n",
|
|
"Number of lines: 573703\n",
|
|
"Number of unique elements: 2293\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"def count_lines_and_unique_elements(file):\n",
|
|
" df = pd.read_csv(file, sep='\\t', header=None)\n",
|
|
" print('Number of lines: ', len(df))\n",
|
|
" print('Number of unique elements: ', len(df[0].unique()))\n",
|
|
"\n",
|
|
"gowalla_path = os.path.join('data', 'gowalla', 'gowalla_checkins.txt')\n",
|
|
"brightkite_path = os.path.join('data', 'brightkite', 'brightkite_checkins.txt')\n",
|
|
"foursquareNYC_path = os.path.join('data', 'foursquare', 'foursquare_checkins_NYC.txt')\n",
|
|
"foursquareTKY_path = os.path.join('data', 'foursquare', 'foursquare_checkins_TKY.txt')\n",
|
|
"\n",
|
|
"_ = [gowalla_path, brightkite_path, foursquareNYC_path, foursquareTKY_path]\n",
|
|
"\n",
|
|
"for path in _:\n",
|
|
" print(path.split(os.sep)[-2])\n",
|
|
" count_lines_and_unique_elements(path)\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We would like to build a graph starting from an edge list. So the basic idea is to create a dictionary where the keys are the unique users and the values are the locations that they visited. Then, we can iterate over the dictionary and create the edges.\n",
|
|
"\n",
|
|
"But, even if we avoids repetitions, the time complexity will be $O(n^2)$, where $n$ is the number of users. And since $n$ is in the order of millions, doing this in python, where we have to build nested for loops, it's a no-go. We need to find a faster way to do this.\n",
|
|
"\n",
|
|
"In the `utils` module I provided anyway a function that does exactly this, but I do not raccomend to use it unless you have countless hours of time spare. It's called `create_checkicreate_checkins_graph_SLOW` and it takes a dataset name as input and returns a networkx graph object. \n",
|
|
"\n",
|
|
"SCRIVERE QUALCOSA RIGUARDO LA FUNZIONE IN C++\n",
|
|
"\n",
|
|
"The function will output a new .tsv file in the form of an edge list, in the `data` folder. Since the C++ program needs to be compiled, I have already created the edge lists for the four datasets, so you can skip this step if you want.\n",
|
|
"\n",
|
|
"Once that we have our edge list, we can build the graph using the function `checkins_graph_from_edges` from the `utils` module. It takes as input the name of the dataset and returns a networkx graph object. The options are\n",
|
|
"\n",
|
|
"- `brightkite`\n",
|
|
"- `gowalla`\n",
|
|
"- `foursquareNYC`\n",
|
|
"- `foursquareTKY`\n",
|
|
"\n",
|
|
"Let's see how it works:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"G_brighkite_checkins = checkins_graph_from_edges('brightkite')\n",
|
|
"G_brighkite_checkins.name = 'Brightkite Checkins Graph'\n",
|
|
"\n",
|
|
"G_gowalla_checkins = checkins_graph_from_edges('gowalla')\n",
|
|
"G_gowalla_checkins.name = 'Gowalla Checkins Graph'\n",
|
|
"\n",
|
|
"G_foursquareNYC_checkins = checkins_graph_from_edges('foursquareNYC')\n",
|
|
"G_foursquareNYC_checkins.name = 'Foursquare NYC Checkins Graph'\n",
|
|
"\n",
|
|
"G_foursquareTKY_checkins = checkins_graph_from_edges('foursquareTKY')\n",
|
|
"G_foursquareTKY_checkins.name = 'Foursquare TKY Checkins Graph'"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that we have our graphs, let's have a look at some basic information about them"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Brightkite Checkins Graph\n",
|
|
"Number of nodes: 44058\n",
|
|
"Number of edges: 106699\n",
|
|
"\n",
|
|
"Gowalla Checkins Graph\n",
|
|
"Number of nodes: 44058\n",
|
|
"Number of edges: 106699\n",
|
|
"\n",
|
|
"Foursquare NYC Checkins Graph\n",
|
|
"Number of nodes: 2293\n",
|
|
"Number of edges: 31261\n",
|
|
"\n",
|
|
"Foursquare TKY Checkins Graph\n",
|
|
"Number of nodes: 1078\n",
|
|
"Number of edges: 7273\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for G in [G_brighkite_checkins, G_gowalla_checkins, G_foursquareNYC_checkins, G_foursquareTKY_checkins]:\n",
|
|
" print(G.name)\n",
|
|
" print('Number of nodes: ', G.number_of_nodes())\n",
|
|
" print('Number of edges: ', G.number_of_edges())\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Friendship network\n",
|
|
"\n",
|
|
"If we want to build the friendship network, fortunately for the gowalla and brightkite datasets we have the edge list, so we can just use the `read_edgelist` function from networkx. For the foursquare dataset, we don't have any information about the friendship of the users, so we will just create a graph with the checkins.\n",
|
|
"\n",
|
|
"To build the friendship network of the first two datasets, we can use the `create_friends_graph` function from the `utils` module. It takes a dataset name as input and returns a networkx graph object. The implementation is pretty straightforward, we just use the `from_pandas_edgelist` function from networkx."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"G_brighkite_friends = friendships_graph('brightkite')\n",
|
|
"G_brighkite_friends.name = 'Brightkite Friendship Graph'\n",
|
|
"\n",
|
|
"G_gowalla_friends = friendships_graph('gowalla')\n",
|
|
"G_gowalla_friends.name = 'Gowalla Friendship Graph'"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that we have our graphs, let's have a look at some basic information about them"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Brightkite Friendship Graph\n",
|
|
"Number of nodes: 58228\n",
|
|
"Number of edges: 214078\n",
|
|
"\n",
|
|
"Gowalla Friendship Graph\n",
|
|
"Number of nodes: 196591\n",
|
|
"Number of edges: 950327\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for G in [G_brighkite_friends, G_gowalla_friends]:\n",
|
|
" print(G.name)\n",
|
|
" print('Number of nodes: ', G.number_of_nodes())\n",
|
|
" print('Number of edges: ', G.number_of_edges())\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Analysis of the structure of the networks"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3.10.6 64-bit",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
},
|
|
"orig_nbformat": 4,
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|