-Thisfileisnotemeanttoberun,it's just a collection of functions that are used in the other files. It'sjustawaytokeepthecodecleanandorganized.
-WhydoIuseos.path.joinandnotthe"/"?Becauseit's more portable, it works on every OS, while "/" works only on Linux and Mac. If you want to use it on Windows, you have to change all the "/" with "\". With os.path.join you don'thavetoworryaboutitand,asalways,f***Microsoft.
# if they don't exist, create 3 subfolders in data called brightkite, gowalla and foursquare
forfolderinfolders:
ifnotos.path.exists(os.path.join("data",folder)):
os.mkdir(os.path.join("data",folder))
# download every url in urls[0] in the brightkite folder, and every url in urls[1] in the gowalla folder, and every url in urls[2] in the foursquare folder. At ech iteration, it checks if the file already exists, if yes, it skips the download and prints a message. If no, it downloads the file and prints a message.
foriinrange(len(urls)):
forurlinurls[i]:
# check if there are .txt files inside folder, if yes, skip the download
print("The {} dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder".format(folders[i]))
break
# check if there are .gz files inside folder, if yes, skip the download
print("The {} dataset is already downloaded as .gz file, if you want to download again the .gz file with this function, delete the .gz files in the folder".format(folders[i]))
break
# if there are no .txt or .gz files, download the file
print("Download completed of {} dataset".format(folders[i]))
# extract the data of foursquare in a nice way, checking all the edge cases as a maniac. More details below
"""
Thecodebelowit's very ugly to read, but it'seffective.Basically,ineverypossiblemessysituationwehavethefiles(maybeaftertesting)insidethefoursquarefolder,itwillfixthemandbringthemastheprogramexpectsthemtobe.
Firstlyitchecksifinthefoursquarefolderthereisafoldercalleddataset_tsmc2014.Iftrue,itchecksifthereare3filesinsidethefoursquarefolders,ifyes,skiptheprocess(everythingisinorder).Iffalse,itmovesallthefilesinsidethedataset_tsmc2014foldertothefoursquarefolderanddeletethedataset_tsmc2014folder(wedon't want a nested folder)
Thisfunctiontakesininputatsvfilewithtwocolumns,Eachlineinthefileisanedge.Thefunctionreturnsanundirectednetworkxgraphobject.Itusespandastoreadthefilesinceit's faster than the standard python open() function. If we don'twanttousethestandardpythonopen()function,thefollowingcodeworksaswell:
Differentlyfromthefunctioncreate_graphusedforthebrightkiteandgowalladataset,wearenotgivenalistofedges,sowecan't use the function nx.from_pandas_edgelist. We have to create the graph manually.
Firstly,weretrivetheuniqueuserIDusingtheset()datastructure:thisarethenodesofourgraph.Sincewedon't want to work with adjacency matrices due to their O(mn) space complexity (even tho, we could memorize them in a compressed way thanks to their sparsity propriety), we use an adjacency list representation of the graph. We create a dictionary with the users ID as keys and the venues ID as values. Two users are connected if they have visited the same venue at least once. The weight of the edge is the number of common venues.
df=pd.read_csv(file,sep="\t",header=None,names=["UserID","VenueID","CategoryID","CategoryName","Latitude","Longitude","Timezone offset in minutes","UTC time"],encoding="utf-8",encoding_errors="ignore")