In [1]:
%reload_ext autoreload

import os
import zipfile
import wget
import networkx as nx
from main import *
import pandas as pd

# Discovering the datasets

To perform our analysis, we will use the following datasets:

- **Brightkite**
- **Gowalla**
- **Foursquare**

We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:

```
├── brightkite
│   ├── brightkite_checkins.txt
│   └── brightkite_friends_edges.txt
├── foursquare
│   ├── foursquare_checkins_NYC.txt
│   ├── foursquare_checkins_TKY.txt
└── gowalla
    ├── gowalla_checkins.txt
    └── gowalla_friends_edges.txt
```

If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.

> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 2 to 3 minutes to download all the datasets.

In [2]:
download_datasets()

Let's have a deeper look at them.

## Brightkite

[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. We will work with two different datasets:

- `data/brightkite/brightkite_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids
- `data/brightkite/brightkite_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph.

## Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets:

- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids
- `data/gowalla/gowalla_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph.

## Foursquare

[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 2 columns, which are:

1. User ID (anonymized)
2. Venue ID (Foursquare)

In this case, we don't have any information about the friendship network, so we will only work with the checkins.

# Building the networks

We are asked to construct the networks for the three datasets as un undirected graph $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.

And this is were the fun begins! The check-ins files of the three datasets are not in the form of a graph edge list, so we need to manipulate them. But those datasets are huge! Let's have a look at the number of lines of each file.

In [16]:
def count_lines_and_unique_elements(file):
    df = pd.read_csv(file, sep='\t', header=None)
    print('Number of lines: ', len(df))
    print('Number of unique elements: ', len(df[0].unique()))

gowalla_path = os.path.join('data', 'gowalla', 'gowalla_checkins.txt')
brightkite_path = os.path.join('data', 'brightkite', 'brightkite_checkins.txt')
foursquareNYC_path = os.path.join('data', 'foursquare', 'foursquare_checkins_NYC.txt')
foursquareTKY_path = os.path.join('data', 'foursquare', 'foursquare_checkins_TKY.txt')

_ = [gowalla_path, brightkite_path, foursquareNYC_path, foursquareTKY_path]

for path in _:
    print(path.split(os.sep)[-2])
    count_lines_and_unique_elements(path)
    print()

gowalla
Number of lines:  6442892
Number of unique elements:  107092

brightkite
Number of lines:  4747287
Number of unique elements:  51406

foursquare
Number of lines:  227428
Number of unique elements:  1083

foursquare
Number of lines:  573703
Number of unique elements:  2293



We would like to build a graph starting from an edge list. So the basic idea is to create a dictionary where the keys are the unique users and the values are the locations that they visited. Then, we can iterate over the dictionary and create the edges.

But, even if we avoids repetitions, the time complexity will be $O(n^2)$, where $n$ is the number of users. And since $n$ is in the order of millions, doing this in python, where we have to build nested for loops, it's a no-go. We need to find a faster way to do this.

In the `utils` module I provided anyway a function that does exactly this, but I do not raccomend to use it unless you have countless hours of time spare. It's called `create_checkicreate_checkins_graph_SLOW` and it takes a dataset name as input and returns a networkx graph object. 

SCRIVERE QUALCOSA RIGUARDO LA FUNZIONE IN C++

The function will output a new .tsv file in the form of an edge list, in the `data` folder. Since the C++ program needs to be compiled, I have already created the edge lists for the four datasets, so you can skip this step if you want.

Once that we have our edge list, we can build the graph using the function `checkins_graph_from_edges` from the `utils` module. It takes as input the name of the dataset and returns a networkx graph object. The options are

- `brightkite`
- `gowalla`
- `foursquareNYC`
- `foursquareTKY`

Let's see how it works:

In [11]:
G_brighkite_checkins = checkins_graph_from_edges('brightkite')
G_brighkite_checkins.name = 'Brightkite Checkins Graph'

G_gowalla_checkins = checkins_graph_from_edges('gowalla')
G_gowalla_checkins.name = 'Gowalla Checkins Graph'

G_foursquareNYC_checkins = checkins_graph_from_edges('foursquareNYC')
G_foursquareNYC_checkins.name = 'Foursquare NYC Checkins Graph'

G_foursquareTKY_checkins = checkins_graph_from_edges('foursquareTKY')
G_foursquareTKY_checkins.name = 'Foursquare TKY Checkins Graph'

Now that we have our graphs, let's have a look at some basic information about them

In [10]:
for G in [G_brighkite_checkins, G_gowalla_checkins, G_foursquareNYC_checkins, G_foursquareTKY_checkins]:
    print(G.name)
    print('Number of nodes: ', G.number_of_nodes())
    print('Number of edges: ', G.number_of_edges())
    print()

Brightkite Checkins Graph
Number of nodes:  44058
Number of edges:  106699

Gowalla Checkins Graph
Number of nodes:  44058
Number of edges:  106699

Foursquare NYC Checkins Graph
Number of nodes:  2293
Number of edges:  31261

Foursquare TKY Checkins Graph
Number of nodes:  1078
Number of edges:  7273



### Friendship network

If we want to build the friendship network, fortunately for the gowalla and brightkite datasets we have the edge list, so we can just use the `read_edgelist` function from networkx. For the foursquare dataset, we don't have any information about the friendship of the users, so we will just create a graph with the checkins.

To build the friendship network of the first two datasets, we can use the `create_friends_graph` function from the `utils` module. It takes a dataset name as input and returns a networkx graph object. The implementation is pretty straightforward, we just use the `from_pandas_edgelist` function from networkx.

In [7]:
G_brighkite_friends = friendships_graph('brightkite')
G_brighkite_friends.name = 'Brightkite Friendship Graph'

G_gowalla_friends = friendships_graph('gowalla')
G_gowalla_friends.name = 'Gowalla Friendship Graph'

Now that we have our graphs, let's have a look at some basic information about them

In [8]:
for G in [G_brighkite_friends, G_gowalla_friends]:
    print(G.name)
    print('Number of nodes: ', G.number_of_nodes())
    print('Number of edges: ', G.number_of_edges())
    print()

Brightkite Friendship Graph
Number of nodes:  58228
Number of edges:  214078

Gowalla Friendship Graph
Number of nodes:  196591
Number of edges:  950327



# Analysis of the structure of the networks