imdb-graph/README.md

# An exact and fast algorithm for computing top-k closeness centrality

The explanation of this algorithm and all it's analysis can be found in the pdf paper

> [Paper](https://github.com/lukefleed/imdb-graph/blob/main/tex/src/main.pdf)


## Documentation

First thing first, we need to clone the repository

```bash
git clone https://github.com/lukefleed/imdb-graph
```

Once done, move in it

```bash
cd imdb-graph
```

### Downloading and filtering the data

All the necessary file are inside the folder `filters`

```bash
cd filters
```
We have two options. If we want to build the graph where the actors are the node, we have to run

```bash
./actors_graph_filter.py --min-movies 42
```

`min-movies` has to ben an integer, `42` is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.


If we want to build the graph where the movies are the nodes, we have to run

```bash
./movie_graph_filter.py --votes 500
```

`votes` has to ben an integer, `500` is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.

All the data filtered will be saved in a new folder called `data`

### Running the program

Let's move into the folder `scripts`. If we want to run the program on the actors graph, use

```bash
./actors_graph top_actors_42
```
> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where `top_actors_42` is the output file name. Anything can be used.

---

If we want to run the program on the movies graph, use

```bash
./movie_graph top_movies_42
```

> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.

where `top_movies_42` is the output file name. Anything can be used

---

Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable `k` in the .cpp files

### Automatic script for different variables of filtering

We are in the folder `scripts`. Inside both the folders `actor-graph` and  `movie-graph` there is a file called `bench_me.sh`. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it

```bash
./bench_me.sh
```

This will also save the logs in a folder called `time`. It can be usefull to analyze the performance of the program.

---

Inside the folders `closeness centrality` (for both graph), there is a python script `analysis.py`. Put all the generated `_c.txt` files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable


### Generating the interactive graphs

First, let's move into the folder `visualization`

```bash
cd visualization
```
As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called `data`

```bash
mkdir data
```

And copy inside it the files

- `Attori.txt`
- `FilmFiltrati.txt`
- `Relazioni.txt`

Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of `min-actors` and `votes` during the filtering are respectively `70` and `100000`. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs

## To Do

- [ ] Organize all the code using `OOP`
- [ ] Normalize the harmonic centrality and it's bound
- [ ] Test with other collaboration networks
- [ ] Give `k` as input parameter
new documentation, drastical change 3 years ago			`# An exact and fast algorithm for computing top-k closeness centrality`
python code fully documented 3 years ago
new documentation, drastical change 3 years ago			`The explanation of this algorithm and all it's analysis can be found in the pdf paper`
Refined docs 3 years ago
new documentation, drastical change 3 years ago			`> [Paper](https://github.com/lukefleed/imdb-graph/blob/main/tex/src/main.pdf)`
better documentation 3 years ago

new documentation, drastical change 3 years ago			`## Documentation`
python code fully documented 3 years ago
new documentation, drastical change 3 years ago			`First thing first, we need to clone the repository`
python code fully documented 3 years ago
new documentation, drastical change 3 years ago			```bash
			`git clone https://github.com/lukefleed/imdb-graph`
			```
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			`Once done, move in it`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			```bash
			`cd imdb-graph`
documentation of the algorithm idea 3 years ago			```

new documentation, drastical change 3 years ago			`### Downloading and filtering the data`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			All the necessary file are inside the folder `filters`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			```bash
			`cd filters`
documentation of the algorithm idea 3 years ago			```
new documentation, drastical change 3 years ago			`We have two options. If we want to build the graph where the actors are the node, we have to run`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			```bash
			`./actors_graph_filter.py --min-movies 42`
documentation of the algorithm idea 3 years ago			```

new documentation, drastical change 3 years ago			`min-movies` has to ben an integer, `42` is just an example. It represents the minimum number of movies that an actor/actress needs to have done to be considered in our graph.
documentation of the algorithm idea 3 years ago


new documentation, drastical change 3 years ago			`If we want to build the graph where the movies are the nodes, we have to run`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			```bash
			`./movie_graph_filter.py --votes 500`
documentation of the algorithm idea 3 years ago			```

new documentation, drastical change 3 years ago			`votes` has to ben an integer, `500` is just an example. It represents the minimum number of votes that a movie needs to have on the IMDb database to be considered in our graph.
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			All the data filtered will be saved in a new folder called `data`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			`### Running the program`
documentation of the algorithm idea 3 years ago
new documentation, drastical change 3 years ago			Let's move into the folder `scripts`. If we want to run the program on the actors graph, use
documentation 3 years ago
new documentation, drastical change 3 years ago			```bash
			`./actors_graph top_actors_42`
partial closeness documentation (something is wrong) 3 years ago			```
new documentation, drastical change 3 years ago			`> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.`
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			where `top_actors_42` is the output file name. Anything can be used.
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			`---`
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			`If we want to run the program on the movies graph, use`
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			```bash
			`./movie_graph top_movies_42`
partial closeness documentation (something is wrong) 3 years ago			```

new documentation, drastical change 3 years ago			`> IMPORTANT: The algorithm is multi-threaded. It's set with a default number of 12, modify the file .cpp and change this value depending on the CPU.`
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			where `top_movies_42` is the output file name. Anything can be used
partial closeness documentation (something is wrong) 3 years ago
			`---`

new documentation, drastical change 3 years ago			Those scripts will generate two files .txt (one for the harmonic and one for the closeness centrality). Those files will have the top-100 elements for the relative centrality. If we want a different value, just change the variable `k` in the .cpp files
partial closeness documentation (something is wrong) 3 years ago
new documentation, drastical change 3 years ago			`### Automatic script for different variables of filtering`
documentation 3 years ago
new documentation, drastical change 3 years ago			We are in the folder `scripts`. Inside both the folders `actor-graph` and `movie-graph` there is a file called `bench_me.sh`. This file will run everything automatically in loop for different values of the filtering variables. To modify this file we need to edit the file. To run it
better documentation 3 years ago
new documentation, drastical change 3 years ago			```bash
			`./bench_me.sh`
			```
better documentation 3 years ago
new documentation, drastical change 3 years ago			This will also save the logs in a folder called `time`. It can be usefull to analyze the performance of the program.
now the size of the node is proportional to the closeness 3 years ago
better documentation 3 years ago			`---`

new documentation, drastical change 3 years ago			Inside the folders `closeness centrality` (for both graph), there is a python script `analysis.py`. Put all the generated `_c.txt` files in the folder and run it. It will return a matrix showing the discrepancy of the results while varying the variable
documentation 3 years ago

new documentation, drastical change 3 years ago			`### Generating the interactive graphs`
documentation 3 years ago
new documentation, drastical change 3 years ago			First, let's move into the folder `visualization`
documentation 3 years ago
new documentation, drastical change 3 years ago			```bash
			`cd visualization`
			```
			As before, we will find two folders, one for each type of graph. Choose the one that we want to with and move into that folder. Inside it we need to create a folder called `data`
better documentation 3 years ago
new documentation, drastical change 3 years ago			```bash
			`mkdir data`
			```
better documentation 3 years ago
new documentation, drastical change 3 years ago			`And copy inside it the files`
better documentation 3 years ago
new documentation, drastical change 3 years ago			- `Attori.txt`
			- `FilmFiltrati.txt`
			- `Relazioni.txt`
better documentation 3 years ago
new documentation, drastical change 3 years ago			Attention! If we are visualizing the actors graph, it's important to copy the file generated for it. Ideal values of `min-actors` and `votes` during the filtering are respectively `70` and `100000`. Since it has to be rendered in a web page, this values will generate graphs with about 1000 nodes. I won't suggest to try with bigger graphs
better documentation 3 years ago
new documentation, drastical change 3 years ago			`## To Do`
better documentation 3 years ago
new documentation, drastical change 3 years ago			- [ ] Organize all the code using `OOP`
			`- [ ] Normalize the harmonic centrality and it's bound`
			`- [ ] Test with other collaboration networks`
			- [ ] Give `k` as input parameter