-Thisfileisnotemeanttoberun,it's just a collection of functions that are used in the other files. It'sjustawaytokeepthecodecleanandorganized.
-WhydoIuseos.path.joinandnotthe"/"?Becauseit's more portable, it works on every OS, while "/" works only on Linux and Mac. If you want to use it on Windows, you have to change all the "/" with "\". With os.path.join you don'thavetoworryaboutitand,asalways,f***Microsoft.
# if they don't exist, create 3 subfolders in data called brightkite, gowalla and foursquare
forfolderinfolders:
ifnotos.path.exists(os.path.join("data",folder)):
os.mkdir(os.path.join("data",folder))
# download every url in urls[0] in the brightkite folder, and every url in urls[1] in the gowalla folder, and every url in urls[2] in the foursquare folder. At ech iteration, it checks if the file already exists, if yes, it skips the download and prints a message. If no, it downloads the file and prints a message.
foriinrange(len(urls)):
forurlinurls[i]:
# check if there are .txt files inside folder, if yes, skip the download
print("The {} dataset is already downloaded and extracted as .txt file, if you want to download again the .gz file with this function, delete the .txt files in the folder".format(folders[i]))
break
# check if there are .gz files inside folder, if yes, skip the download
print("The {} dataset is already downloaded as .gz file, if you want to download again the .gz file with this function, delete the .gz files in the folder".format(folders[i]))
break
# if there are no .txt or .gz files, download the file
print("Download completed of {} dataset".format(folders[i]))
# extract the data of foursquare in a nice way, checking all the edge cases as a maniac. More details below
"""
Thecodebelowit's very ugly to read, but it'seffective.Basically,ineverypossiblemessysituationwehavethefiles(maybeaftertesting)insidethefoursquarefolder,itwillfixthemandbringthemastheprogramexpectsthemtobe.
Firstlyitchecksifinthefoursquarefolderthereisafoldercalleddataset_tsmc2014.Iftrue,itchecksifthereare3filesinsidethefoursquarefolders,ifyes,skiptheprocess(everythingisinorder).Iffalse,itmovesallthefilesinsidethedataset_tsmc2014foldertothefoursquarefolderanddeletethedataset_tsmc2014folder(wedon't want a nested folder)
Thisfunctiontakesininputatsvfilewithtwocolumns,Eachlineinthefileisanedge.Thefunctionreturnsanundirectednetworkxgraphobject.Itusespandastoreadthefilesinceit's faster than the standard python open() function. If we don'twanttousethestandardpythonopen()function,thefollowingcodeworksaswell:
G=nx.Graph()
withopen(file,"r")asf:
forlineinf:
node1,node2=line.split("\t")
G.add_edge(node1,node2)
"""
ifdatasetnotin["brightkite","gowalla"]:
raiseValueError("The dataset must be brightkite or gowalla")