In [2]:
%reload_ext autoreload

import os
import zipfile
import wget
import networkx as nx
from main import *
import pandas as pd

# Discovering the datasets

To perform our analysis, we will use the following datasets:

- **Brightkite**
- **Gowalla**
- **Foursquare**

We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:

```
data/
├── brightkite
│   ├── loc-brightkite_edges.txt.gz
│   ├── loc-brightkite_totalCheckins.txt.gz
├── foursquare
│   ├── loc-gowalla_edges.txt.gz
│   ├── loc-gowalla_totalCheckins.txt.gz
└── gowalla
    ├── dataset_ubicomp2013_checkins.txt
    ├── dataset_ubicomp2013_tags.txt
    └── dataset_ubicomp2013_tips.txt
```

If any of the datasets is already downloaded, it will not be downloaded again. For futher details about the function below, please refer to the `utils` module.

In [2]:
download_datasets()

The brightkite dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder
The gowalla dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder
The foursquare dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder


Let's have a deeper look at them.

## Brightkite

[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. The network was originally directed but the authors of the dataset have constructed a network with undirected edges when there is a friendship in both ways. They also have also collected a total of `4491143` checking of these users over the period of Apr. 2008 - Oct. 2010.

Here is an example of check-in information

In [3]:
brightkite_path = os.path.join("data", "brightkite", "loc-brightkite_totalCheckins.txt")
Brightkite_df = pd.read_csv(brightkite_path, sep="\t", header=None, names=["user", "check-in time", "latitude", "longitude", "location_id"])

Brightkite_df.head()

Unnamed: 0,user,check-in time,latitude,longitude,location_id
0,0,2010-10-17T01:48:53Z,39.747652,-104.99251,88c46bf20db295831bd2d1718ad7e6f5
1,0,2010-10-16T06:02:04Z,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
2,0,2010-10-16T03:48:54Z,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
3,0,2010-10-14T18:25:51Z,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
4,0,2010-10-14T00:21:47Z,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc


## Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. The authors have collected a total of `6442890` check-ins of these users over the period of Feb. 2009 - Oct. 2010.

Here is an example of check-in information

In [4]:
gowalla_path = os.path.join("data", "gowalla", "loc-gowalla_totalCheckins.txt")

Gowalla_df = pd.read_csv(gowalla_path, sep="\t", header=None, names=["user", "check-in time", "latitude", "longitude", "location_id"])

Gowalla_df.head() 

Unnamed: 0,user,check-in time,latitude,longitude,location_id
0,0,2010-10-19T23:55:27Z,30.235909,-97.79514,22847
1,0,2010-10-18T22:17:43Z,30.269103,-97.749395,420315
2,0,2010-10-17T23:42:03Z,30.255731,-97.763386,316637
3,0,2010-10-17T19:26:05Z,30.263418,-97.757597,16516
4,0,2010-10-16T18:50:42Z,30.274292,-97.740523,5535878


## Foursquare

[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 8 columns, which are:

1. User ID (anonymized)
2. Venue ID (Foursquare)
3. Venue category ID (Foursquare)
4. Venue category name (Foursquare)
5. Latitude
6. Longitude
7. Timezone offset in minutes (The offset in minutes between when this check-in occurred and the same time in UTC)
8. UTC time

Here is an example of check-in information from the New York dataset:

In [5]:
foursquare_NYC_path = ny = os.path.join("data", "foursquare", "dataset_TSMC2014_NYC.txt")
foursquare_TKY_path = ny = os.path.join("data", "foursquare", "dataset_TSMC2014_TKY.txt")

foursquare_NYC_df = pd.read_csv(foursquare_NYC_path, sep="\t", header=None, names=["UserID", "VenueID", "CategoryID", "CategoryName", "Latitude", "Longitude", "Timezone offset in minutes", "UTC time"], encoding="utf-8", encoding_errors="ignore")

foursquare_NYC_df.head()

Unnamed: 0,UserID,VenueID,CategoryID,CategoryName,Latitude,Longitude,Timezone offset in minutes,UTC time
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,Tue Apr 03 18:00:09 +0000 2012
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,Tue Apr 03 18:00:25 +0000 2012
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,Tue Apr 03 18:02:24 +0000 2012
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,Tue Apr 03 18:02:41 +0000 2012
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,Tue Apr 03 18:03:00 +0000 2012


In [6]:
# remove from memory, they were created only for aesthetic purposes in the notebook

del Brightkite_df
del Gowalla_df
del foursquare_NYC_df

# Building the networks

We are asked to construct the networks for the three datasets as un undirected grah $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.

We can use the fucntion create_graph from the `utils` module to create the networks. It takes as input the path to an edge list file and returns a networkx graph object. For further details about the function below, please refer to the `utils` module.

In [4]:
Brightkite_G = create_checkins_graph("brightkite")
Gowalla_G = create_checkins_graph("gowalla")
Foursquare_G = create_checkins_graph("foursquareNYC")

Number of nodes added to the graph brightkite: 51406


KeyboardInterrupt: 

Now we can have a look at the number of nodes and edges in each network.

In [8]:
dataset = ["brightkite", "gowalla", "foursquare"]
nodes = [len(Brightkite_G.nodes()), len(Gowalla_G.nodes()), len(Foursquare_G.nodes())]
edges = [len(Brightkite_G.edges()), len(Gowalla_G.edges()), len(Foursquare_G.edges())]

df = pd.DataFrame({"dataset": dataset, "nodes": nodes, "edges": edges})
df

Unnamed: 0,dataset,nodes,edges
0,brightkite,58228,214078
1,gowalla,196591,950327
2,foursquare,1083,282405


As we can see, the foursquare dataset has a very small number of nodes. Even tho it has 227428 check-ins, the unique users (the nodes) are only 1083. The Tokyo dataset is about 2 times bigger, with 537703 check-ins and 2294 nodes. Since we are in the same order of magnitude, we will focus on the New York dataset, in the style of a classic Hollywood movie about aliens invasions.

# Analysis of the structure of the networks

In [3]:
path = "data/brightkite/loc-brightkite_totalCheckins.txt"
# modify the file, take only the first and last column, return a test.txt file. Use pandas

def modify_file(path):
    df = pd.read_csv(path, sep="\t", header=None, names=["user", "check-in time", "latitude", "longitude", "location_id"])
    df = df.iloc[:, [0, 4]]
    df.to_csv("test.txt", sep="\t", header=None, index=None)

modify_file(path)