You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

118 lines
3.8 KiB
Markdown

# An exact and fast algorithm for computing top-k closeness centrality
The explanation of this algorithm and all it's analysis can be found in the pdf paper
> [Paper](https://github.com/lukefleed/imdb-graph/blob/main/tex/src/main.pdf)
## Documentation
First thing first, we need to clone the repository
```bash
git clone https://github.com/lukefleed/imdb-graph
```
Once done, move in it
```bash
cd imdb-graph
```
### Downloading and filtering the data
All the necessary file are inside the folder `filters`
```bash
cd filters
```
We have two options. If we want to build the graph where the actors are the node, we have to run
```bash
./actors_graph_filter.py --min-movies 42
```
`min-movies` has to ben an integer, `42` is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.
If we want to build the graph where the movies are the nodes, we have to run
```bash
./movie_graph_filter.py --votes 500
```
`votes` has to ben an integer, `500` is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.
All the data filtered will be saved in a new folder called `data`
### Running the program
Let's move into the folder `scripts`. If we want to run the program on the actors graph, use
```bash
./actors_graph top_actors_42
```
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
where `top_actors_42` is the output file name. Anything can be used.
---
If we want to run the program on the movies graph, use
```bash
./movie_graph top_movies_42
```
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.
where `top_movies_42` is the output file name. Anything can be used
---
Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable `k` in the .cpp files
### Automatic script for different variables of filtering
We are in the folder `scripts`. Inside both the folders `actor-graph` and `movie-graph` there is a file called `bench_me.sh`. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it
```bash
./bench_me.sh
```
This will also save the logs in a folder called `time`. It can be usefull to analyze the performance of the program.
---
Inside the folders `closeness centrality` (for both graph), there is a python script `analysis.py`. Put all the generated `_c.txt` files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable
### Generating the interactive graphs
First, let's move into the folder `visualization`
```bash
cd visualization
```
As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called `data`
```bash
mkdir data
```
And copy inside it the files
- `Attori.txt`
- `FilmFiltrati.txt`
- `Relazioni.txt`
Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of `min-actors` and `votes` during the filtering are respectively `70` and `100000`. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs
## To Do
- [ ] Organize all the code using `OOP`
- [ ] Normalize the harmonic centrality and it's bound
- [ ] Test with other collaboration networks
- [ ] Give `k` as input parameter