You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

357 lines
12 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"\n",
"import os\n",
"import zipfile\n",
"import wget\n",
"import networkx as nx\n",
"from main import *\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Discovering the datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To perform our analysis, we will use the following datasets:\n",
"\n",
"- **Brightkite**\n",
"- **Gowalla**\n",
"- **Foursquare**\n",
"\n",
"We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
"\n",
"```\n",
"├── brightkite\n",
"│   ├── brightkite_checkins.txt\n",
"│   └── brightkite_friends_edges.txt\n",
"├── foursquare\n",
"│   ├── foursquare_checkins_NYC.txt\n",
"│   ├── foursquare_checkins_TKY.txt\n",
"└── gowalla\n",
" ├── gowalla_checkins.txt\n",
" └── gowalla_friends_edges.txt\n",
"```\n",
"\n",
"If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n",
"\n",
"> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 2 to 3 minutes to download all the datasets."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"download_datasets()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a deeper look at them.\n",
"\n",
"## Brightkite\n",
"\n",
"[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. We will work with two different datasets:\n",
"\n",
"- `data/brightkite/brightkite_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
"- `data/brightkite/brightkite_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gowalla\n",
"\n",
"Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets:\n",
"\n",
"- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
"- `data/gowalla/gowalla_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Foursquare\n",
"\n",
"[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 2 columns, which are:\n",
"\n",
"1. User ID (anonymized)\n",
"2. Venue ID (Foursquare)\n",
"\n",
"In this case, we don't have any information about the friendship network, so we will only work with the checkins."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building the networks"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We are asked to construct the networks for the three datasets as un undirected graph $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n",
"\n",
"And this is were the fun begins! The check-ins files of the three datasets are not in the form of a graph edge list, so we need to manipulate them. But those datasets are huge! Let's have a look at the number of lines of each file."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gowalla\n",
"Number of lines: 6442892\n",
"Number of unique elements: 107092\n",
"\n",
"brightkite\n",
"Number of lines: 4747287\n",
"Number of unique elements: 51406\n",
"\n",
"foursquare\n",
"Number of lines: 227428\n",
"Number of unique elements: 1083\n",
"\n",
"foursquare\n",
"Number of lines: 573703\n",
"Number of unique elements: 2293\n",
"\n"
]
}
],
"source": [
"def count_lines_and_unique_elements(file):\n",
" df = pd.read_csv(file, sep='\\t', header=None)\n",
" print('Number of lines: ', len(df))\n",
" print('Number of unique elements: ', len(df[0].unique()))\n",
"\n",
"gowalla_path = os.path.join('data', 'gowalla', 'gowalla_checkins.txt')\n",
"brightkite_path = os.path.join('data', 'brightkite', 'brightkite_checkins.txt')\n",
"foursquareNYC_path = os.path.join('data', 'foursquare', 'foursquare_checkins_NYC.txt')\n",
"foursquareTKY_path = os.path.join('data', 'foursquare', 'foursquare_checkins_TKY.txt')\n",
"\n",
"_ = [gowalla_path, brightkite_path, foursquareNYC_path, foursquareTKY_path]\n",
"\n",
"for path in _:\n",
" print(path.split(os.sep)[-2])\n",
" count_lines_and_unique_elements(path)\n",
" print()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We would like to build a graph starting from an edge list. So the basic idea is to create a dictionary where the keys are the unique users and the values are the locations that they visited. Then, we can iterate over the dictionary and create the edges.\n",
"\n",
"But, even if we avoids repetitions, the time complexity will be $O(n^2)$, where $n$ is the number of users. And since $n$ is in the order of millions, doing this in python, where we have to build nested for loops, it's a no-go. We need to find a faster way to do this.\n",
"\n",
"In the `utils` module I provided anyway a function that does exactly this, but I do not raccomend to use it unless you have countless hours of time spare. It's called `create_checkicreate_checkins_graph_SLOW` and it takes a dataset name as input and returns a networkx graph object. \n",
"\n",
"SCRIVERE QUALCOSA RIGUARDO LA FUNZIONE IN C++\n",
"\n",
"The function will output a new .tsv file in the form of an edge list, in the `data` folder. Since the C++ program needs to be compiled, I have already created the edge lists for the four datasets, so you can skip this step if you want.\n",
"\n",
"Once that we have our edge list, we can build the graph using the function `checkins_graph_from_edges` from the `utils` module. It takes as input the name of the dataset and returns a networkx graph object. The options are\n",
"\n",
"- `brightkite`\n",
"- `gowalla`\n",
"- `foursquareNYC`\n",
"- `foursquareTKY`\n",
"\n",
"Let's see how it works:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"G_brighkite_checkins = checkins_graph_from_edges('brightkite')\n",
"G_brighkite_checkins.name = 'Brightkite Checkins Graph'\n",
"\n",
"G_gowalla_checkins = checkins_graph_from_edges('gowalla')\n",
"G_gowalla_checkins.name = 'Gowalla Checkins Graph'\n",
"\n",
"G_foursquareNYC_checkins = checkins_graph_from_edges('foursquareNYC')\n",
"G_foursquareNYC_checkins.name = 'Foursquare NYC Checkins Graph'\n",
"\n",
"G_foursquareTKY_checkins = checkins_graph_from_edges('foursquareTKY')\n",
"G_foursquareTKY_checkins.name = 'Foursquare TKY Checkins Graph'"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our graphs, let's have a look at some basic information about them"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Brightkite Checkins Graph\n",
"Number of nodes: 44058\n",
"Number of edges: 106699\n",
"\n",
"Gowalla Checkins Graph\n",
"Number of nodes: 44058\n",
"Number of edges: 106699\n",
"\n",
"Foursquare NYC Checkins Graph\n",
"Number of nodes: 2293\n",
"Number of edges: 31261\n",
"\n",
"Foursquare TKY Checkins Graph\n",
"Number of nodes: 1078\n",
"Number of edges: 7273\n",
"\n"
]
}
],
"source": [
"for G in [G_brighkite_checkins, G_gowalla_checkins, G_foursquareNYC_checkins, G_foursquareTKY_checkins]:\n",
" print(G.name)\n",
" print('Number of nodes: ', G.number_of_nodes())\n",
" print('Number of edges: ', G.number_of_edges())\n",
" print()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Friendship network\n",
"\n",
"If we want to build the friendship network, fortunately for the gowalla and brightkite datasets we have the edge list, so we can just use the `read_edgelist` function from networkx. For the foursquare dataset, we don't have any information about the friendship of the users, so we will just create a graph with the checkins.\n",
"\n",
"To build the friendship network of the first two datasets, we can use the `create_friends_graph` function from the `utils` module. It takes a dataset name as input and returns a networkx graph object. The implementation is pretty straightforward, we just use the `from_pandas_edgelist` function from networkx."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"G_brighkite_friends = friendships_graph('brightkite')\n",
"G_brighkite_friends.name = 'Brightkite Friendship Graph'\n",
"\n",
"G_gowalla_friends = friendships_graph('gowalla')\n",
"G_gowalla_friends.name = 'Gowalla Friendship Graph'"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our graphs, let's have a look at some basic information about them"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Brightkite Friendship Graph\n",
"Number of nodes: 58228\n",
"Number of edges: 214078\n",
"\n",
"Gowalla Friendship Graph\n",
"Number of nodes: 196591\n",
"Number of edges: 950327\n",
"\n"
]
}
],
"source": [
"for G in [G_brighkite_friends, G_gowalla_friends]:\n",
" print(G.name)\n",
" print('Number of nodes: ', G.number_of_nodes())\n",
" print('Number of edges: ', G.number_of_edges())\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of the structure of the networks"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}