imdb-graph/README.md

# An exact and fast algorithm for computing top-k closeness centrality

The explanation of this algorithm and all it's analysis can be found in the pdf paper

> [Paper](https://github.com/lukefleed/imdb-graph/blob/main/tex/src/main.pdf)


## Documentation

First thing first, we need to clone the repository

```bash
git clone https://github.com/lukefleed/imdb-graph
```

Once done, move in it

```bash
cd imdb-graph
```

### Downloading and filtering the data

All the necessary file are inside the folder `filters`

```bash
cd filters
```
We have two options. If we want to build the graph where the actors are the node, we have to run

```bash
./actors_graph_filter.py --min-movies 42
```

`min-movies` has to ben an integer, `42` is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.


If we want to build the graph where the movies are the nodes, we have to run

```bash
./movie_graph_filter.py --votes 500
```

`votes` has to ben an integer, `500` is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.

All the data filtered will be saved in a new folder called `data`

### Running the program

Let's move into the folder `scripts`. If we want to run the program on the actors graph, use

```bash
./actors_graph top_actors_42
```
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where `top_actors_42` is the output file name. Anything can be used.

---

If we want to run the program on the movies graph, use

```bash
./movie_graph top_movies_42
```

> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where `top_movies_42` is the output file name. Anything can be used

---

Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable `k` in the .cpp files

### Automatic script for different variables of filtering

We are in the folder `scripts`. Inside both the folders `actor-graph` and  `movie-graph` there is a file called `bench_me.sh`. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it

```bash
./bench_me.sh
```

This will also save the logs in a folder called `time`. It can be usefull to analyze the performance of the program.

---

Inside the folders `closeness centrality` (for both graph), there is a python script `analysis.py`. Put all the generated `_c.txt` files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable


### Generating the interactive graphs

First, let's move into the folder `visualization`

```bash
cd visualization
```
As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called `data`

```bash
mkdir data
```

And copy inside it the files

- `Attori.txt`
- `FilmFiltrati.txt`
- `Relazioni.txt`

Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of `min-actors` and `votes` during the filtering are respectively `70` and `100000`. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs

## To Do

- [ ] Organize all the code using `OOP`
- [ ] Normalize the harmonic centrality and it's bound
- [ ] Test with other collaboration networks
- [ ] Give `k` as input parameter