small-worlds/testing.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "\n",
    "import os\n",
    "import zipfile\n",
    "import wget\n",
    "import networkx as nx\n",
    "from main import *\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Discovering the datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform our analysis, we will use the following datasets:\n",
    "\n",
    "- **Brightkite**\n",
    "- **Gowalla**\n",
    "- **Foursquare**\n",
    "\n",
    "We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
    "\n",
    "```\n",
    "data/\n",
    "├── brightkite\n",
    "│   ├── loc-brightkite_edges.txt.gz\n",
    "│   ├── loc-brightkite_totalCheckins.txt.gz\n",
    "├── foursquare\n",
    "│   ├── loc-gowalla_edges.txt.gz\n",
    "│   ├── loc-gowalla_totalCheckins.txt.gz\n",
    "└── gowalla\n",
    "    ├── dataset_ubicomp2013_checkins.txt\n",
    "    ├── dataset_ubicomp2013_tags.txt\n",
    "    └── dataset_ubicomp2013_tips.txt\n",
    "```\n",
    "\n",
    "If any of the datasets is already downloaded, it will not be downloaded again. For futher details about the function below, please refer to the `utils` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The brightkite dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n",
      "The gowalla dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n",
      "The foursquare dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n"
     ]
    }
   ],
   "source": [
    "download_datasets()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a deeper look at them.\n",
    "\n",
    "## Brightkite\n",
    "\n",
    "[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. The network was originally directed but the authors of the dataset have constructed a network with undirected edges when there is a friendship in both ways. They also have also collected a total of `4491143` checking of these users over the period of Apr. 2008 - Oct. 2010.\n",
    "\n",
    "Here is an example of check-in information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>check-in time</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>location_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-17T01:48:53Z</td>\n",
       "      <td>39.747652</td>\n",
       "      <td>-104.992510</td>\n",
       "      <td>88c46bf20db295831bd2d1718ad7e6f5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-16T06:02:04Z</td>\n",
       "      <td>39.891383</td>\n",
       "      <td>-105.070814</td>\n",
       "      <td>7a0f88982aa015062b95e3b4843f9ca2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-16T03:48:54Z</td>\n",
       "      <td>39.891077</td>\n",
       "      <td>-105.068532</td>\n",
       "      <td>dd7cd3d264c2d063832db506fba8bf79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-14T18:25:51Z</td>\n",
       "      <td>39.750469</td>\n",
       "      <td>-104.999073</td>\n",
       "      <td>9848afcc62e500a01cf6fbf24b797732f8963683</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-14T00:21:47Z</td>\n",
       "      <td>39.752713</td>\n",
       "      <td>-104.996337</td>\n",
       "      <td>2ef143e12038c870038df53e0478cefc</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user         check-in time   latitude   longitude  \\\n",
       "0     0  2010-10-17T01:48:53Z  39.747652 -104.992510   \n",
       "1     0  2010-10-16T06:02:04Z  39.891383 -105.070814   \n",
       "2     0  2010-10-16T03:48:54Z  39.891077 -105.068532   \n",
       "3     0  2010-10-14T18:25:51Z  39.750469 -104.999073   \n",
       "4     0  2010-10-14T00:21:47Z  39.752713 -104.996337   \n",
       "\n",
       "                                location_id  \n",
       "0          88c46bf20db295831bd2d1718ad7e6f5  \n",
       "1          7a0f88982aa015062b95e3b4843f9ca2  \n",
       "2          dd7cd3d264c2d063832db506fba8bf79  \n",
       "3  9848afcc62e500a01cf6fbf24b797732f8963683  \n",
       "4          2ef143e12038c870038df53e0478cefc  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "brightkite_path = os.path.join(\"data\", \"brightkite\", \"loc-brightkite_totalCheckins.txt\")\n",
    "Brightkite_df = pd.read_csv(brightkite_path, sep=\"\\t\", header=None, names=[\"user\", \"check-in time\", \"latitude\", \"longitude\", \"location_id\"])\n",
    "\n",
    "Brightkite_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gowalla\n",
    "\n",
    "Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. The authors have collected a total of `6442890` check-ins of these users over the period of Feb. 2009 - Oct. 2010.\n",
    "\n",
    "Here is an example of check-in information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>check-in time</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>location_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-19T23:55:27Z</td>\n",
       "      <td>30.235909</td>\n",
       "      <td>-97.795140</td>\n",
       "      <td>22847</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-18T22:17:43Z</td>\n",
       "      <td>30.269103</td>\n",
       "      <td>-97.749395</td>\n",
       "      <td>420315</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-17T23:42:03Z</td>\n",
       "      <td>30.255731</td>\n",
       "      <td>-97.763386</td>\n",
       "      <td>316637</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-17T19:26:05Z</td>\n",
       "      <td>30.263418</td>\n",
       "      <td>-97.757597</td>\n",
       "      <td>16516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2010-10-16T18:50:42Z</td>\n",
       "      <td>30.274292</td>\n",
       "      <td>-97.740523</td>\n",
       "      <td>5535878</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user         check-in time   latitude  longitude  location_id\n",
       "0     0  2010-10-19T23:55:27Z  30.235909 -97.795140        22847\n",
       "1     0  2010-10-18T22:17:43Z  30.269103 -97.749395       420315\n",
       "2     0  2010-10-17T23:42:03Z  30.255731 -97.763386       316637\n",
       "3     0  2010-10-17T19:26:05Z  30.263418 -97.757597        16516\n",
       "4     0  2010-10-16T18:50:42Z  30.274292 -97.740523      5535878"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gowalla_path = os.path.join(\"data\", \"gowalla\", \"loc-gowalla_totalCheckins.txt\")\n",
    "\n",
    "Gowalla_df = pd.read_csv(gowalla_path, sep=\"\\t\", header=None, names=[\"user\", \"check-in time\", \"latitude\", \"longitude\", \"location_id\"])\n",
    "\n",
    "Gowalla_df.head() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Foursquare\n",
    "\n",
    "[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 8 columns, which are:\n",
    "\n",
    "1. User ID (anonymized)\n",
    "2. Venue ID (Foursquare)\n",
    "3. Venue category ID (Foursquare)\n",
    "4. Venue category name (Foursquare)\n",
    "5. Latitude\n",
    "6. Longitude\n",
    "7. Timezone offset in minutes (The offset in minutes between when this check-in occurred and the same time in UTC)\n",
    "8. UTC time\n",
    "\n",
    "Here is an example of check-in information from the New York dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserID</th>\n",
       "      <th>VenueID</th>\n",
       "      <th>CategoryID</th>\n",
       "      <th>CategoryName</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Timezone offset in minutes</th>\n",
       "      <th>UTC time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>470</td>\n",
       "      <td>49bbd6c0f964a520f4531fe3</td>\n",
       "      <td>4bf58dd8d48988d127951735</td>\n",
       "      <td>Arts &amp; Crafts Store</td>\n",
       "      <td>40.719810</td>\n",
       "      <td>-74.002581</td>\n",
       "      <td>-240</td>\n",
       "      <td>Tue Apr 03 18:00:09 +0000 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>979</td>\n",
       "      <td>4a43c0aef964a520c6a61fe3</td>\n",
       "      <td>4bf58dd8d48988d1df941735</td>\n",
       "      <td>Bridge</td>\n",
       "      <td>40.606800</td>\n",
       "      <td>-74.044170</td>\n",
       "      <td>-240</td>\n",
       "      <td>Tue Apr 03 18:00:25 +0000 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>69</td>\n",
       "      <td>4c5cc7b485a1e21e00d35711</td>\n",
       "      <td>4bf58dd8d48988d103941735</td>\n",
       "      <td>Home (private)</td>\n",
       "      <td>40.716162</td>\n",
       "      <td>-73.883070</td>\n",
       "      <td>-240</td>\n",
       "      <td>Tue Apr 03 18:02:24 +0000 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>395</td>\n",
       "      <td>4bc7086715a7ef3bef9878da</td>\n",
       "      <td>4bf58dd8d48988d104941735</td>\n",
       "      <td>Medical Center</td>\n",
       "      <td>40.745164</td>\n",
       "      <td>-73.982519</td>\n",
       "      <td>-240</td>\n",
       "      <td>Tue Apr 03 18:02:41 +0000 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>87</td>\n",
       "      <td>4cf2c5321d18a143951b5cec</td>\n",
       "      <td>4bf58dd8d48988d1cb941735</td>\n",
       "      <td>Food Truck</td>\n",
       "      <td>40.740104</td>\n",
       "      <td>-73.989658</td>\n",
       "      <td>-240</td>\n",
       "      <td>Tue Apr 03 18:03:00 +0000 2012</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   UserID                   VenueID                CategoryID  \\\n",
       "0     470  49bbd6c0f964a520f4531fe3  4bf58dd8d48988d127951735   \n",
       "1     979  4a43c0aef964a520c6a61fe3  4bf58dd8d48988d1df941735   \n",
       "2      69  4c5cc7b485a1e21e00d35711  4bf58dd8d48988d103941735   \n",
       "3     395  4bc7086715a7ef3bef9878da  4bf58dd8d48988d104941735   \n",
       "4      87  4cf2c5321d18a143951b5cec  4bf58dd8d48988d1cb941735   \n",
       "\n",
       "          CategoryName   Latitude  Longitude  Timezone offset in minutes  \\\n",
       "0  Arts & Crafts Store  40.719810 -74.002581                        -240   \n",
       "1               Bridge  40.606800 -74.044170                        -240   \n",
       "2       Home (private)  40.716162 -73.883070                        -240   \n",
       "3       Medical Center  40.745164 -73.982519                        -240   \n",
       "4           Food Truck  40.740104 -73.989658                        -240   \n",
       "\n",
       "                         UTC time  \n",
       "0  Tue Apr 03 18:00:09 +0000 2012  \n",
       "1  Tue Apr 03 18:00:25 +0000 2012  \n",
       "2  Tue Apr 03 18:02:24 +0000 2012  \n",
       "3  Tue Apr 03 18:02:41 +0000 2012  \n",
       "4  Tue Apr 03 18:03:00 +0000 2012  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "foursquare_NYC_path = ny = os.path.join(\"data\", \"foursquare\", \"dataset_TSMC2014_NYC.txt\")\n",
    "foursquare_TKY_path = ny = os.path.join(\"data\", \"foursquare\", \"dataset_TSMC2014_TKY.txt\")\n",
    "\n",
    "foursquare_NYC_df = pd.read_csv(foursquare_NYC_path, sep=\"\\t\", header=None, names=[\"UserID\", \"VenueID\", \"CategoryID\", \"CategoryName\", \"Latitude\", \"Longitude\", \"Timezone offset in minutes\", \"UTC time\"], encoding=\"utf-8\", encoding_errors=\"ignore\")\n",
    "\n",
    "foursquare_NYC_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove from memory, they were created only for aesthetic purposes in the notebook\n",
    "\n",
    "del Brightkite_df\n",
    "del Gowalla_df\n",
    "del foursquare_NYC_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building the networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are asked to construct the networks for the three datasets as un undirected grah $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n",
    "\n",
    "We can use the fucntion create_graph from the `utils` module to create the networks. It takes as input the path to an edge list file and returns a networkx graph object. For further details about the function below, please refer to the `utils` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of nodes added to the graph brightkite: 51406\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_51637/2618808480.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mBrightkite_G\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcreate_checkins_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"brightkite\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mGowalla_G\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcreate_checkins_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"gowalla\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mFoursquare_G\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcreate_checkins_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"foursquareNYC\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/github/small-worlds/main.py\u001b[0m in \u001b[0;36mcreate_checkins_graph\u001b[0;34m(dataset)\u001b[0m\n\u001b[1;32m    132\u001b[0m     \u001b[0;31m# now add the edges, try to use pandas to speed up the process\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    133\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0muser1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser2\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcombinations\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m         \u001b[0mintersection\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers_venues\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0muser1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m&\u001b[0m \u001b[0mset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers_venues\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0muser2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    135\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mintersection\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    136\u001b[0m             \u001b[0mG\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_edge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muser1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mweight\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mintersection\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "Brightkite_G = create_checkins_graph(\"brightkite\")\n",
    "Gowalla_G = create_checkins_graph(\"gowalla\")\n",
    "Foursquare_G = create_checkins_graph(\"foursquareNYC\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can have a look at the number of nodes and edges in each network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dataset</th>\n",
       "      <th>nodes</th>\n",
       "      <th>edges</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>brightkite</td>\n",
       "      <td>58228</td>\n",
       "      <td>214078</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>gowalla</td>\n",
       "      <td>196591</td>\n",
       "      <td>950327</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>foursquare</td>\n",
       "      <td>1083</td>\n",
       "      <td>282405</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      dataset   nodes   edges\n",
       "0  brightkite   58228  214078\n",
       "1     gowalla  196591  950327\n",
       "2  foursquare    1083  282405"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = [\"brightkite\", \"gowalla\", \"foursquare\"]\n",
    "nodes = [len(Brightkite_G.nodes()), len(Gowalla_G.nodes()), len(Foursquare_G.nodes())]\n",
    "edges = [len(Brightkite_G.edges()), len(Gowalla_G.edges()), len(Foursquare_G.edges())]\n",
    "\n",
    "df = pd.DataFrame({\"dataset\": dataset, \"nodes\": nodes, \"edges\": edges})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the foursquare dataset has a very small number of nodes. Even tho it has 227428 check-ins, the unique users (the nodes) are only 1083. The Tokyo dataset is about 2 times bigger, with 537703 check-ins and 2294 nodes. Since we are in the same order of magnitude, we will focus on the New York dataset, in the style of a classic Hollywood movie about aliens invasions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis of the structure of the networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"data/brightkite/loc-brightkite_totalCheckins.txt\"\n",
    "# modify the file, take only the first and last column, return a test.txt file. Use pandas\n",
    "\n",
    "def modify_file(path):\n",
    "    df = pd.read_csv(path, sep=\"\\t\", header=None, names=[\"user\", \"check-in time\", \"latitude\", \"longitude\", \"location_id\"])\n",
    "    df = df.iloc[:, [0, 4]]\n",
    "    df.to_csv(\"test.txt\", sep=\"\\t\", header=None, index=None)\n",
    "\n",
    "modify_file(path)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.6 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}