In [1]:
%reload_ext autoreload

import os
import zipfile
import wget
import networkx as nx
from main import *
import pandas as pd

# Discovering the datasets

To perform our analysis, we will use the following datasets:

- **Brightkite**
- **Gowalla**
- **Foursquare**

We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:

```
data/
├── brightkite
│   ├── loc-brightkite_edges.txt.gz
│   ├── loc-brightkite_totalCheckins.txt.gz
├── foursquare
│   ├── loc-gowalla_edges.txt.gz
│   ├── loc-gowalla_totalCheckins.txt.gz
└── gowalla
 ├── dataset_ubicomp2013_checkins.txt
 ├── dataset_ubicomp2013_tags.txt
 └── dataset_ubicomp2013_tips.txt
```

If any of the datasets is already downloaded, it will not be downloaded again. For futher details about the function below, please refer to the `utils` module.

In [2]:
download_datasets()

The brightkite dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder
The gowalla dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder
Downloading foursquare dataset...
Download completed of foursquare dataset


Let's have a deeper look at them.

## Brightkite

[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. The network was originally directed but the authors of the dataset have constructed a network with undirected edges when there is a friendship in both ways. They also have also collected a total of `4491143` checking of these users over the period of Apr. 2008 - Oct. 2010.

Here is an example of check-in information

In [None]:
Brightkite_df = pd.read_csv("data/brightkite/loc-brightkite_totalCheckins.txt.gz", sep="\t", header=None, compression="gzip", names=["user", "check-in time", "latitude", "longitude", "location_id"])

Brightkite_df.head()

## Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. The authors have collected a total of `6442890` check-ins of these users over the period of Feb. 2009 - Oct. 2010.

Here is an example of check-in information

In [None]:
Gowalla_df = pd.read_csv("data/gowalla/loc-gowalla_totalCheckins.txt.gz", sep="\t", header=None, compression="gzip", names=["user", "check-in time", "latitude", "longitude", "location_id"])

Gowalla_df.head() 

## Foursquare

DA RISCRIVERE

In [None]:
# remove from memory, they were created only for aesthetic purposes in the notebook

del Brightkite_df
del Gowalla_df
del Foursquare_checks_df

# Building the networks

We are asked to construct the networks for the three datasets as un undirected grah $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.

We can use the fucntion create_graph from the utils module to create the networks. It takes as input the path to an edge list file and returns a networkx graph object. For further details about the function below, please refer to the `utils` module.

In [3]:
Brightkite_G = create_graph("brightkite")
Gowalla_G = create_graph("gowalla")
Foursquare_G = create_foursquare_graph("NYC")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data

Now we can have a look at the number of nodes and edges in each network.

In [None]:
print("Brightkite graph has {} nodes and {} edges".format(Brightkite_G.number_of_nodes(), Brightkite_G.number_of_edges()))

print("Gowalla graph has {} nodes and {} edges".format(Gowalla_G.number_of_nodes(), Gowalla_G.number_of_edges()))