An exact and fast algorithm for computing top-k closeness centrality, tested on the IMDb databse
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Luca Lombardo ec3026938b last sections revision 3 years ago
filters arg parse for movie filter 3 years ago
scripts notebooks for time performance analysis 3 years ago
tex last sections revision 3 years ago
visualization moved 3 years ago
.gitattributes stats clean 3 years ago
.gitignore minor change 3 years ago
README.md new documentation, drastical change 3 years ago
paper.pdf analysis almost complete, visualization complete 3 years ago

README.md

An exact and fast algorithm for computing top-k closeness centrality

The explanation of this algorithm and all it's analysis can be found in the pdf paper

Paper

Documentation

First thing first, we need to clone the repository

git clone https://github.com/lukefleed/imdb-graph

Once done, move in it

cd imdb-graph

Downloading and filtering the data

All the necessary file are inside the folder filters

cd filters

We have two options. If we want to build the graph where the actors are the node, we have to run

./actors_graph_filter.py --min-movies 42

min-movies has to ben an integer, 42 is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.

If we want to build the graph where the movies are the nodes, we have to run

./movie_graph_filter.py --votes 500

votes has to ben an integer, 500 is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.

All the data filtered will be saved in a new folder called data

Running the program

Let's move into the folder scripts. If we want to run the program on the actors graph, use

./actors_graph top_actors_42

IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where top_actors_42 is the output file name. Anything can be used.


If we want to run the program on the movies graph, use

./movie_graph top_movies_42

IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where top_movies_42 is the output file name. Anything can be used


Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable k in the .cpp files

Automatic script for different variables of filtering

We are in the folder scripts. Inside both the folders actor-graph and movie-graph there is a file called bench_me.sh. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it

./bench_me.sh

This will also save the logs in a folder called time. It can be usefull to analyze the performance of the program.


Inside the folders closeness centrality (for both graph), there is a python script analysis.py. Put all the generated _c.txt files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable

Generating the interactive graphs

First, let's move into the folder visualization

cd visualization

As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called data

mkdir data

And copy inside it the files

  • Attori.txt
  • FilmFiltrati.txt
  • Relazioni.txt

Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of min-actors and votes during the filtering are respectively 70 and 100000. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs

To Do

  • Organize all the code using OOP
  • Normalize the harmonic centrality and it's bound
  • Test with other collaboration networks
  • Give k as input parameter