// It receives the file name as a string and returns a dictionary with the keys being the UserID and the values being a vector of VenueID associated with that UserID.
-WhydoIuseos.path.joinandnotthe"/"?Becauseit's more portable, it works on every OS, while "/" works only on Linux and Mac. If you want to use it on Windows, you have to change all the "/" with "\". With os.path.join you don'thavetoworryaboutitand,asalways,f***Microsoft.
-WhydoIuseos.path.joinandnotthe"/"?Becauseit's more portable, it works on every OS, while "/" works only on Linux and Mac. If you want to use it on Windows, you have to change all the "/" with "\". With os.path.join you don'thavetoworryaboutitand,asalways,f***Microsoft.
# download every url in urls[0] in the brightkite folder, and every url in urls[1] in the gowalla folder, and every url in urls[2] in the foursquare folder. At ech iteration, it checks if the file already exists, if yes, it skips the download and prints a message. If no, it downloads the file and prints a message.
# download every url in urls[0] in the brightkite folder, and every url in urls[1] in the gowalla folder, and every url in urls[2] in the foursquare folder. If the file is already downloaded, skip the download
foriinrange(len(urls)):
foriinrange(len(urls)):
forurlinurls[i]:
forurlinurls[i]:
# check if there are .txt files inside folder, if yes, skip the download
print("The {} dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder".format(folders[i]))
break
# check if there are .gz files inside folder, if yes, skip the download
print("The {} dataset is already downloaded as .gz file, if you want to download again the .gz file with this function, delete the .gz files in the folder".format(folders[i]))
break
# if there are no .txt or .gz files, download the file
# if there are no .txt files inside the brightkite folder, unzip the .gz files
# Now we want to clean our data. Both for brightkite and gowalla, we want to rename _edges files as "brightkite_friends_edges.txt" and "gowalla_friends_edges.txt"
# Now we from the _totalCheckins.txt files we want to keep only the first and last column, which are the user ID and the venue ID. We also want to remove the header of the file. Use pandas to do that. Then rename the files as "brightkite_checkins_edges.txt" and "gowalla_checkins_edges.txt
# now for foursquare we want to keep only the first and second column, which are the user ID and the venue ID. We also want to remove the header of the file. Use pandas to do that. Do that for both _NYC.txt and _TKY.txt files. Then rename the files as "foursquare_checkins_edges_NYC.txt" and "foursquare_checkins_edges_TKY.txt
Thisfunctiontakesininputatsvfilewithtwocolumns,Eachlineinthefileisanedge.Thefunctionreturnsanundirectednetworkxgraphobject.Itusespandastoreadthefilesinceit's faster than the standard python open() function. If we don'twanttousethestandardpythonopen()function,thefollowingcodeworksaswell:
// It receives the file name as a string and returns a dictionary with the keys being the UserID and the values being a vector of VenueID associated with that UserID.
"We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
"We can download the datasets using the function `download_dataset` from the `utils` module. It will download the datasets in the `data` folder, organized in sub-folders in the following way:\n",
"\n",
"\n",
"```\n",
"```\n",
"data/\n",
"├── brightkite\n",
"├── brightkite\n",
"│ ├── loc-brightkite_edges.txt.gz\n",
"│ ├── brightkite_checkins.txt\n",
"│ ├── loc-brightkite_totalCheckins.txt.gz\n",
"│ └── brightkite_friends_edges.txt\n",
"├── foursquare\n",
"├── foursquare\n",
"│ ├── loc-gowalla_edges.txt.gz\n",
"│ ├── foursquare_checkins_NYC.txt\n",
"│ ├── loc-gowalla_totalCheckins.txt.gz\n",
"│ ├── foursquare_checkins_TKY.txt\n",
"└── gowalla\n",
"└── gowalla\n",
" ├── dataset_ubicomp2013_checkins.txt\n",
" ├── gowalla_checkins.txt\n",
" ├── dataset_ubicomp2013_tags.txt\n",
" └── gowalla_friends_edges.txt\n",
" └── dataset_ubicomp2013_tips.txt\n",
"```\n",
"```\n",
"\n",
"\n",
"If any of the datasets is already downloaded, it will not be downloaded again. For futher details about the function below, please refer to the `utils` module."
"If any of the datasets is already downloaded, it will not be downloaded again. For further details about the function below, please refer to the `utils` module.\n",
"\n",
"> NOTE: the Stanford servers tends to be slow, so it may take a while to download the datasets. It's gonna take about 2 to 3 minutes to download all the datasets."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 2,
"execution_count": 2,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [],
{
"name": "stdout",
"output_type": "stream",
"text": [
"The brightkite dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n",
"The gowalla dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n",
"The foursquare dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder\n"
]
}
],
"source": [
"source": [
"download_datasets()"
"download_datasets()"
]
]
},
},
{
{
"attachments": {},
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
@ -79,552 +71,258 @@
"\n",
"\n",
"## Brightkite\n",
"## Brightkite\n",
"\n",
"\n",
"[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. The network was originally directed but the authors of the dataset have constructed a network with undirected edges when there is a friendship in both ways. They also have also collected a total of `4491143` checking of these users over the period of Apr. 2008 - Oct. 2010.\n",
"[Brightkite](http://www.brightkite.com/) was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API. We will work with two different datasets:\n",
"\n",
"\n",
"Here is an example of check-in information"
"- `data/brightkite/brightkite_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
"- `data/brightkite/brightkite_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
"Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. As for Brightkite, we will work with two different datasets:\n",
"\n",
"\n",
"Brightkite_df.head()"
"- `data/gowalla/gowalla_friends_edges.txt`: the friendship network, a tsv file with 2 columns of users ids\n",
"- `data/gowalla/gowalla_checkins.txt`: the checkins, a tsv file with 2 columns of user id and location. This is not in the form of a graph edge list, in the next section we will see how to convert it into a graph."
]
]
},
},
{
{
"attachments": {},
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"## Gowalla\n",
"## Foursquare\n",
"\n",
"[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 2 columns, which are:\n",
"\n",
"\n",
"Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API. The authors have collected a total of `6442890` check-ins of these users over the period of Feb. 2009 - Oct. 2010.\n",
"1. User ID (anonymized)\n",
"2. Venue ID (Foursquare)\n",
"\n",
"\n",
"Here is an example of check-in information"
"In this case, we don't have any information about the friendship network, so we will only work with the checkins."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "markdown",
"execution_count": 4,
"metadata": {},
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user</th>\n",
" <th>check-in time</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>location_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>2010-10-19T23:55:27Z</td>\n",
" <td>30.235909</td>\n",
" <td>-97.795140</td>\n",
" <td>22847</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>2010-10-18T22:17:43Z</td>\n",
" <td>30.269103</td>\n",
" <td>-97.749395</td>\n",
" <td>420315</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>2010-10-17T23:42:03Z</td>\n",
" <td>30.255731</td>\n",
" <td>-97.763386</td>\n",
" <td>316637</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>2010-10-17T19:26:05Z</td>\n",
" <td>30.263418</td>\n",
" <td>-97.757597</td>\n",
" <td>16516</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>2010-10-16T18:50:42Z</td>\n",
" <td>30.274292</td>\n",
" <td>-97.740523</td>\n",
" <td>5535878</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" user check-in time latitude longitude location_id\n",
"We are asked to construct the networks for the three datasets as un undirected graph $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n",
"\n",
"[Foursquare](https://foursquare.com/) is a location-based social networking website where users share their locations by checking-in. This dataset includes long-term (about 10 months) check-in data in New York city and Tokyo collected from Foursquare from 12 April 2012 to 16 February 2013. It contains two files in tsv format. Each file contains 8 columns, which are:\n",
"\n",
"1. User ID (anonymized)\n",
"2. Venue ID (Foursquare)\n",
"3. Venue category ID (Foursquare)\n",
"4. Venue category name (Foursquare)\n",
"5. Latitude\n",
"6. Longitude\n",
"7. Timezone offset in minutes (The offset in minutes between when this check-in occurred and the same time in UTC)\n",
"8. UTC time\n",
"\n",
"\n",
"Here is an example of check-in information from the New York dataset:"
"And this is were the fun begins! The check-ins files of the three datasets are not in the form of a graph edge list, so we need to manipulate them. But those datasets are huge! Let's have a look at the number of lines of each file."
"# remove from memory, they were created only for aesthetic purposes in the notebook\n",
"We would like to build a graph starting from an edge list. So the basic idea is to create a dictionary where the keys are the unique users and the values are the locations that they visited. Then, we can iterate over the dictionary and create the edges.\n",
"\n",
"But, even if we avoids repetitions, the time complexity will be $O(n^2)$, where $n$ is the number of users. And since $n$ is in the order of millions, doing this in python, where we have to build nested for loops, it's a no-go. We need to find a faster way to do this.\n",
"\n",
"In the `utils` module I provided anyway a function that does exactly this, but I do not raccomend to use it unless you have countless hours of time spare. It's called `create_checkicreate_checkins_graph_SLOW` and it takes a dataset name as input and returns a networkx graph object. \n",
"\n",
"SCRIVERE QUALCOSA RIGUARDO LA FUNZIONE IN C++\n",
"\n",
"The function will output a new .tsv file in the form of an edge list, in the `data` folder. Since the C++ program needs to be compiled, I have already created the edge lists for the four datasets, so you can skip this step if you want.\n",
"\n",
"\n",
"del Brightkite_df\n",
"Once that we have our edge list, we can build the graph using the function `checkins_graph_from_edges` from the `utils` module. It takes as input the name of the dataset and returns a networkx graph object. The options are\n",
"We are asked to construct the networks for the three datasets as un undirected grah $M = (V, E)$, where $V$ is the set of nodes and $E$ is the set of edges. The nodes represent the users and the edges indicates that two individuals visited the same location at least once.\n",
"Now that we have our graphs, let's have a look at some basic information about them"
"\n",
"We can use the fucntion create_graph from the `utils` module to create the networks. It takes as input the path to an edge list file and returns a networkx graph object. For further details about the function below, please refer to the `utils` module."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 4,
"execution_count": 10,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Number of nodes added to the graph brightkite: 51406\n"
"\u001b[0;32m~/github/small-worlds/main.py\u001b[0m in \u001b[0;36mcreate_checkins_graph\u001b[0;34m(dataset)\u001b[0m\n\u001b[1;32m 132\u001b[0m \u001b[0;31m# now add the edges, try to use pandas to speed up the process\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0muser1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser2\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcombinations\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m \u001b[0mintersection\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers_venues\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0muser1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m&\u001b[0m \u001b[0mset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musers_venues\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0muser2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 135\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mintersection\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 136\u001b[0m \u001b[0mG\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_edge\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muser1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mweight\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mintersection\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
" print('Number of nodes: ', G.number_of_nodes())\n",
" print('Number of edges: ', G.number_of_edges())\n",
" print()"
]
]
},
},
{
{
"attachments": {},
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"Now we can have a look at the number of nodes and edges in each network."
"### Friendship network\n",
"\n",
"If we want to build the friendship network, fortunately for the gowalla and brightkite datasets we have the edge list, so we can just use the `read_edgelist` function from networkx. For the foursquare dataset, we don't have any information about the friendship of the users, so we will just create a graph with the checkins.\n",
"\n",
"To build the friendship network of the first two datasets, we can use the `create_friends_graph` function from the `utils` module. It takes a dataset name as input and returns a networkx graph object. The implementation is pretty straightforward, we just use the `from_pandas_edgelist` function from networkx."
"As we can see, the foursquare dataset has a very small number of nodes. Even tho it has 227428 check-ins, the unique users (the nodes) are only 1083. The Tokyo dataset is about 2 times bigger, with 537703 check-ins and 2294 nodes. Since we are in the same order of magnitude, we will focus on the New York dataset, in the style of a classic Hollywood movie about aliens invasions."
"Now that we have our graphs, let's have a look at some basic information about them"
]
]
},
},
{
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Brightkite Friendship Graph\n",
"Number of nodes: 58228\n",
"Number of edges: 214078\n",
"\n",
"Gowalla Friendship Graph\n",
"Number of nodes: 196591\n",
"Number of edges: 950327\n",
"\n"
]
}
],
"source": [
"source": [
"# Analysis of the structure of the networks"
"for G in [G_brighkite_friends, G_gowalla_friends]:\n",
" print(G.name)\n",
" print('Number of nodes: ', G.number_of_nodes())\n",
" print('Number of edges: ', G.number_of_edges())\n",