You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
118 lines
3.8 KiB
Markdown
118 lines
3.8 KiB
Markdown
# An exact and fast algorithm for computing top-k closeness centrality
|
|
|
|
The explanation of this algorithm and all it's analysis can be found in the pdf paper
|
|
|
|
> [Paper](https://github.com/lukefleed/imdb-graph/blob/main/tex/src/main.pdf)
|
|
|
|
|
|
## Documentation
|
|
|
|
First thing first, we need to clone the repository
|
|
|
|
```bash
|
|
git clone https://github.com/lukefleed/imdb-graph
|
|
```
|
|
|
|
Once done, move in it
|
|
|
|
```bash
|
|
cd imdb-graph
|
|
```
|
|
|
|
### Downloading and filtering the data
|
|
|
|
All the necessary file are inside the folder `filters`
|
|
|
|
```bash
|
|
cd filters
|
|
```
|
|
We have two options. If we want to build the graph where the actors are the node, we have to run
|
|
|
|
```bash
|
|
./actors_graph_filter.py --min-movies 42
|
|
```
|
|
|
|
`min-movies` has to ben an integer, `42` is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.
|
|
|
|
|
|
|
|
If we want to build the graph where the movies are the nodes, we have to run
|
|
|
|
```bash
|
|
./movie_graph_filter.py --votes 500
|
|
```
|
|
|
|
`votes` has to ben an integer, `500` is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.
|
|
|
|
All the data filtered will be saved in a new folder called `data`
|
|
|
|
### Running the program
|
|
|
|
Let's move into the folder `scripts`. If we want to run the program on the actors graph, use
|
|
|
|
```bash
|
|
./actors_graph top_actors_42
|
|
```
|
|
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
|
|
|
|
where `top_actors_42` is the output file name. Anything can be used.
|
|
|
|
---
|
|
|
|
If we want to run the program on the movies graph, use
|
|
|
|
```bash
|
|
./movie_graph top_movies_42
|
|
```
|
|
|
|
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
|
|
|
|
where `top_movies_42` is the output file name. Anything can be used
|
|
|
|
---
|
|
|
|
Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable `k` in the .cpp files
|
|
|
|
### Automatic script for different variables of filtering
|
|
|
|
We are in the folder `scripts`. Inside both the folders `actor-graph` and `movie-graph` there is a file called `bench_me.sh`. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it
|
|
|
|
```bash
|
|
./bench_me.sh
|
|
```
|
|
|
|
This will also save the logs in a folder called `time`. It can be usefull to analyze the performance of the program.
|
|
|
|
---
|
|
|
|
Inside the folders `closeness centrality` (for both graph), there is a python script `analysis.py`. Put all the generated `_c.txt` files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable
|
|
|
|
|
|
### Generating the interactive graphs
|
|
|
|
First, let's move into the folder `visualization`
|
|
|
|
```bash
|
|
cd visualization
|
|
```
|
|
As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called `data`
|
|
|
|
```bash
|
|
mkdir data
|
|
```
|
|
|
|
And copy inside it the files
|
|
|
|
- `Attori.txt`
|
|
- `FilmFiltrati.txt`
|
|
- `Relazioni.txt`
|
|
|
|
Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of `min-actors` and `votes` during the filtering are respectively `70` and `100000`. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs
|
|
|
|
## To Do
|
|
|
|
- [ ] Organize all the code using `OOP`
|
|
- [ ] Normalize the harmonic centrality and it's bound
|
|
- [ ] Test with other collaboration networks
|
|
- [ ] Give `k` as input parameter
|