7649ddde22 | 3 years ago | |
---|---|---|
filters | 3 years ago | |
scripts | 3 years ago | |
tex | 3 years ago | |
visualization | 3 years ago | |
.gitattributes | 3 years ago | |
.gitignore | 3 years ago | |
README.md | 3 years ago | |
paper.pdf | 3 years ago |
README.md
An exact and fast algorithm for computing top-k closeness centrality
The explanation of this algorithm and all it's analysis can be found in the pdf paper
Documentation
First thing first, we need to clone the repository
git clone https://github.com/lukefleed/imdb-graph
Once done, move in it
cd imdb-graph
Downloading and filtering the data
All the necessary file are inside the folder filters
cd filters
We have two options. If we want to build the graph where the actors are the node, we have to run
./actors_graph_filter.py --min-movies 42
min-movies
has to ben an integer, 42
is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.
If we want to build the graph where the movies are the nodes, we have to run
./movie_graph_filter.py --votes 500
votes
has to ben an integer, 500
is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.
All the data filtered will be saved in a new folder called data
Running the program
Let's move into the folder scripts
. If we want to run the program on the actors graph, use
./actors_graph top_actors_42
IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
where top_actors_42
is the output file name. Anything can be used.
If we want to run the program on the movies graph, use
./movie_graph top_movies_42
IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
where top_movies_42
is the output file name. Anything can be used
Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable k
in the .cpp files
Automatic script for different variables of filtering
We are in the folder scripts
. Inside both the folders actor-graph
and movie-graph
there is a file called bench_me.sh
. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it
./bench_me.sh
This will also save the logs in a folder called time
. It can be usefull to analyze the performance of the program.
Inside the folders closeness centrality
(for both graph), there is a python script analysis.py
. Put all the generated _c.txt
files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable
Generating the interactive graphs
First, let's move into the folder visualization
cd visualization
As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called data
mkdir data
And copy inside it the files
Attori.txt
FilmFiltrati.txt
Relazioni.txt
Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of min-actors
and votes
during the filtering are respectively 70
and 100000
. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs
To Do
- Organize all the code using
OOP
- Normalize the harmonic centrality and it's bound
- Give
k
as input parameter