analysis almost complete, visualization complete

main
Luca Lombardo 3 years ago
parent 466cbb70c9
commit 08efac9f77

@ -1 +1 @@
tex/main.pdf tex/src/main.pdf

35
tex/code.tex vendored

@ -1,35 +0,0 @@
\section{An overview of the code}
The algorithm implement is multi-threaded and written in C\texttt{++}
\subsection{Data structures}
In this case we are working with two simple \texttt{struct} for the classes \emph{Film} and \emph{Actor}
\lstinputlisting[language=c++]{code/struct.cpp}
\s
\nd Then we need two dictionaries build like this
\lstinputlisting[language=c++]{code/map.cpp}
\s
\nd We are considering the files \texttt{Attori.txt} and \texttt{FilmFiltrati.txt}, we don't need the relations one for now. Once that we have read this two files, we loop on each one brutally filling the two dictionaries created before. If a line is empty, we skip it. We are using a try and catch approach. Even if the good practice is to use it only for a specific error, since we are outputting everything on the terminal it makes sense to \emph{catch} any error.
\lstinputlisting[language=c++]{code/data.cpp}
\s
Now we can use the file \texttt{Relazioni.txt}. As before, we loop on all the elements of this file, creating the variables
\begin{itemize}
\item \texttt{id\textunderscore film}: index key of each movie
\item \texttt{id\textunderscore attore}: index key of each actor
\end{itemize}
\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we updated the list of indices of actors/actresses that played in the movies with that id.
\lstinputlisting[language=c++]{code/graph.cpp}
\s
Now that we have defined how to build this graph, we have to implement the algorithm what will return the top-k central elements. \s
\nd The code can be found here: \url{https://github.com/lukefleed/imdb-graph}
\s
\begin{center}
\qrcode[height=1in]{https://github.com/lukefleed/imdb-graph}
\end{center}

Before

Width:  |  Height:  |  Size: 72 KiB

After

Width:  |  Height:  |  Size: 72 KiB

Before

Width:  |  Height:  |  Size: 86 KiB

After

Width:  |  Height:  |  Size: 86 KiB

Before

Width:  |  Height:  |  Size: 517 KiB

After

Width:  |  Height:  |  Size: 517 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

@ -1,6 +1,6 @@
\section{The algorithm} \section{The algorithm}
In a connected graph, given a node $v \in V$, we can define the its farness as In a connected graph, given a node $v \in V$, we can define its farness as
\begin{equation} \begin{equation}
f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w) f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w)
@ -18,21 +18,21 @@ where $r(v) = |R(v)|$ is the cardinality of the set of reachable nodes from $v$.
With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0 With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0
\subsection{The lower bound technique} \subsection{The lower bound technique}
During the computation of the farness, for each node, we have to compute the distance from that node and all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \s During the computation of the farness, for each node, we have to compute the distance from that node and to all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \s
\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. This will allow us to kill the BFS operation before reaches the end if the lower bound tell us that the node will not be in the top-$k$. More precisely: \noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. If the lower bound tell us that the node will not be in the top-$k$, this will allow us to kill the BFS operation before it reaches the end. More precisely:
\begin{itemize} \begin{itemize}
\item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top-actors}. From now on, this vector will be full. \item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top}. From now on, this vector will be full.
\item Then, for all the next vertices, it defines a lower bound \item Then, for all the next nodes, it defines a lower bound
\begin{equation}\label{lower-bound} \begin{equation}\label{lower-bound}
\frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d) \frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)
\end{equation} \end{equation}
where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top-actors}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top-actors} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top-actors} anyway. \s where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level $d$ the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top} anyway. \s
The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all vertices are reachable form $v$. The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all nodes are reachable form $v$.
\end{itemize} \end{itemize}
\textsc{Scrivere pseudocodice} \textsc{Scrivere pseudocodice}

@ -11,40 +11,69 @@ The first one will tell us how much more efficient the algorithm is in terms of
\subsection{Actors graph} \label{actors-graph} \subsection{Actors graph} \label{actors-graph}
Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In the case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation. Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In the case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation.
Varying this variable obviously affects the algorithm, in different way. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors Varying this variable obviously affects the algorithm, in different ways. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors
\subsubsection{Time of execution} \label{time-actors}
In this section we are going to analyze the performance of the algorithm in function of different values of \texttt{MIN\textunderscore ACTORS}. Low values of this variable will lead to and exponential growth of the cardinality of the nodes and edges set cardinality.
\newpage
\begin{figure}[h!]
\centering
\includegraphics[width=12cm]{actors_time.png}
\caption{\emph{CPU time} in relation to the \texttt{MIN\textunderscore ACTORS} variable}
\end{figure}
\nd In the analysis it's only taken into consideration the \emph{CPU time} (divided by the number of threads). However, \emph{system time} is in the order of a few minutes in the worst case.
\subsubsection{Time of execution}
TO DO
\subsubsection{Discrepancy of the results} \subsubsection{Discrepancy of the results}
We want to analyze how truthful our results are while varying \texttt{MIN\textunderscore ACTORS}. The methodology is simple: for each results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s We want to analyze how truthful our results are while varying \texttt{MIN\textunderscore ACTORS}. The methodology is simple: for each results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s
\nd A way to see this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MIN\textunderscore ACTORS} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MIN\textunderscore ACTORS} set as $i$ and $j$ \s \nd A way to see this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MIN\textunderscore ACTORS} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MIN\textunderscore ACTORS} set as $i$ and $j$ \s
\newpage
\nd This analysis is implemented in python using the \texttt{pandas} and \texttt{numpy} libraries. \nd This analysis is implemented in python using the \texttt{pandas} and \texttt{numpy} libraries.
\lstinputlisting[language=c++]{code/closeness_analysis.py} \lstinputlisting[language=c++]{code/closeness_analysis.py}
\nd Visualizing this analysis we obtain this \nd Visualizing it we obtain this
\begin{figure}[h] \label{matrix-a} \begin{figure}[h!] \label{matrix-a}
\includegraphics[width=12cm]{Figure_1.png} \centering
\includegraphics[width=11.5cm]{Figure_1.png}
\caption{Discrepancy of the results on the actors graph in function of the minimum number of movies required to be considered as a node} \caption{Discrepancy of the results on the actors graph in function of the minimum number of movies required to be considered as a node}
\end{figure} \end{figure}
\nd As expected, the matrix is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MIN\textunderscore ACTORS} the results are more precise. The discrepancy with \texttt{MIN\textunderscore ACTORS=10} is 14\% while being 39\% when \texttt{MIN\textunderscore ACTORS=70}. \s \nd As expected, the matrix is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MIN\textunderscore ACTORS} the results are more precise. The discrepancy with \texttt{MIN\textunderscore ACTORS=10} is 14\% while being 39\% when \texttt{MIN\textunderscore ACTORS=70}. \s
\nd This is what we obtain confronting the top-k results when $k=100$. It's interesting to se how much the discrepancy change with different values of $k$. However, choosing a lower value for $k$ would not be useful for this type of analysis. Since we are looking at the not common elements of two lists, with a small length, we would get results biased by statistical straggling. \s \nd This is what we obtain confronting the top-k results when $k=100$. It's interesting to se how much the discrepancy changes with different values of $k$. However, choosing a lower value for $k$ would not be useful for this type of analysis. Since we are looking at the not common elements of two lists, with a small length, we would get results biased by statistical straggling. \s
\textsc{Da fare: test con con k=500 e k=1000} \textsc{Da fare: test con k=500 e k=1000}
\s \s
\newpage \newpage
\subsection{Movies Graphs} \subsection{Movies Graphs}
In this section we are taking into consideration the graph build over the movies and their common actors/actresses. Due to an elevated number of nodes, to optimize the performance during the execution in the section \ref{filtering} we introduced the variable \texttt{VOTES}. It represents the minimum number of votes (indifferently is positive or negative) that a movie need to have on the IMDb database to be considered as a node in our graph. In this section we are taking into consideration the graph build over the movies and their common actors/actresses. Due to an elevated number of nodes, to optimize the performance, in section \ref{filtering} we introduced the variable \texttt{VOTES}. It represents the minimum number of votes (indifferently if positive or negative) that a movie needs to have on the IMDb database to be considered as a node in our graph.
As seen during the analysis of the actors graph in \ref{actors-graph}, varying this kind of variables affects the results in many ways.
\subsubsection{Time of execution}
As seen in \ref{time-actors} we are going to analyze the performance of the algorithm in function of different values of \texttt{VOTES}. Low values of this variable will lead to and exponential growth of the cardinality of the nodes and edges set cardinality. And as we know, with a bigger graph there are more operations to do.
\begin{figure}[h!]
\centering
\includegraphics[width=12cm]{movies_time.png}
\caption{\emph{CPU time} in relation to the \texttt{VOTES} variable}
\end{figure}
\newpage
\subsubsection{Discrepancy of the results}
As seen during the analysis of the actors graph in \ref{actors-graph}, varying this kind of variables affects the results in many ways. All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{matrix-a}), we are going to use a matrix to visualize and analyze the results All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{matrix-a}), we are going to use a matrix to visualize and analyze the results
\s \s
% \lstinputlisting[language=c++]{code/closeness_analysis_2.py} % \lstinputlisting[language=c++]{code/closeness_analysis_2.py}
@ -58,4 +87,4 @@ As seen during the analysis of the actors graph in \ref{actors-graph}, varying t
\newpage \newpage
\lstinputlisting[language=c++]{code/closeness_analysis_2.py} \lstinputlisting[language=c++]{code/closeness_analysis_2.py}
\s \nd \emph{Dire qualcosa sull'analisi, ma andrebbe rifatta perché i valori non vanno bene} \s \nd In this graph there is much more discrepancy in the results with a lower cardinality of the node's set. Even if the lowest and biggest value of \texttt{VOTES} give us a graph with the same order of nodes of the previous one, the percentage difference in accuracy is completely different. The reason to that is that those two graphs taken into example are very different. If we want an higher accuracy on the movies graph, we have to loose some performance and use lower values of \texttt{VOTES}.

88
tex/src/code.tex vendored

@ -0,0 +1,88 @@
\section{An overview of the code}
The algorithm implement is multi-threaded and written in C\texttt{++}. To avoid redundances, we'll take in exame only the \emph{Actors Graph} case.
\subsection{Data structures}
In this case we are working with two simple \texttt{struct} for the classes \emph{Film} and \emph{Actor}
\lstinputlisting[language=c++]{code/struct.cpp}
\s
\nd Then we need two dictionaries build like this
\lstinputlisting[language=c++]{code/map.cpp}
\s
\nd We are considering the files \texttt{Attori.txt} and \texttt{FilmFiltrati.txt}, we don't need the relations one for now. Once that we have read this two files, we loop on each one brutally filling the two dictionaries created before. If a line is empty, we skip it. We are using a try and catch approach. Even if the good practice is to use it only for a specific error, since we are outputting everything on the terminal it makes sense to \emph{catch} any error.
\lstinputlisting[language=c++]{code/data.cpp}
\s
Now we can use the file \texttt{Relazioni.txt}. As before, we loop on all the elements of this file, creating the variables
\begin{itemize}
\item \texttt{id\textunderscore film}: index key of each movie
\item \texttt{id\textunderscore attore}: index key of each actor
\end{itemize}
\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we update the list of indices of actors/actresses that played in the movies with that id.
\lstinputlisting[language=c++]{code/graph.cpp}
\s
Now that we have defined how to build this graph, we have to implement the algorithm what will return the top-k central elements. \s
\nd The code can be found here: \url{https://github.com/lukefleed/imdb-graph}
\s
\begin{center}
\qrcode{https://github.com/lukefleed/imdb-graph}
\end{center}
\subsection{Results - Actors Graph}
Here are the top-10 actors for closeness centrality obtained with the variable \texttt{MIN\textunderscore ACTORS=5} (as we'll see in the next section, it's the most accurate)
\begin{table}[h!]
\centering
\begin{tabular}{||c c||}
\hline
Node & Closeness centrality \\ [0.5ex]
\hline\hline
Eric Roberts & 0.324895 \\
Christopher Lee &0.319873 \\
Franco Nero & 0.31946 \\
John Savage & 0.316258 \\
Michael Madsen & 0.314451 \\
Udo Kier & 0.31357 \\
Geraldine Chaplin & 0.313141 \\
Malcolm McDowell & 0.313014 \\
David Carradine & 0.312648 \\
Christopher Plummer & 0.311859 \\ [1ex]
\hline
\end{tabular}
\end{table}
\nd All the other results are available in the Github repository for all the values of \texttt{MIN\textunderscore ACTORS} and for $k=100$
\newpage
\subsection{Results - Movies Graph}
Here are the top-10 movies for closeness centrality obtained with the variable \texttt{VOTES=500} (as we'll see in the next section, it's the most accurate)
\begin{table}[h!]
\centering
\begin{tabular}{||c c||}
\hline
Node & Closeness centrality \\ [0.5ex]
\hline\hline
Merlin & 0.290731 \\
The Odyssey & 0.290314 \\
The Color of Magic & 0.285208 \\
The Godfather Saga & 0.284932 \\
Jack and the Beanstalk: The Real Story & 0.283522 \\
In the Beginning & 0.28347 \\
RED 2 & 0.283362 \\
Lonesome Dove & 0.283353 \\
Moses & 0.282953 \\
Species & 0.282642 \\ [1ex]
\hline
\end{tabular}
\end{table}
\nd All the other results are available in the Github repository for all the values of \texttt{VOTES} and for $k=100$

@ -62,7 +62,7 @@ Let's have a closer look to this 4 files:
This is a crucial section for the algorithm in this particolar case study. This raw data contains a huge amount of un-useful information that will just have a negative impact on the performance during the computation. We are going to see in detail all the modification made for each file. All this operation have been implemented using \texttt{python} and the \texttt{pandas} library. \s This is a crucial section for the algorithm in this particolar case study. This raw data contains a huge amount of un-useful information that will just have a negative impact on the performance during the computation. We are going to see in detail all the modification made for each file. All this operation have been implemented using \texttt{python} and the \texttt{pandas} library. \s
Since we want to build two different graph, some consideration will have to be considered for the specific case. If nothing is told it means that the filtering of that file is the same for both graphs. \nd Since we want to build two different graph, some consideration will have to be made for the specific case. If nothing is told it means that the filtering of that file is the same for both graphs.
\subsubsection{name.basics.tsv} \subsubsection{name.basics.tsv}
@ -98,7 +98,7 @@ Since all the movies starts with the string \texttt{t0} we can remove it to clea
\item \texttt{tvMovie} \item \texttt{tvMovie}
\item \texttt{tvMiniSeries} \item \texttt{tvMiniSeries}
\end{itemize} \end{itemize}
The reason to only consider this categories is purely to optimize the performance during the computation. On IMDb each episode is listed as a single element: to remove them without loosing the most important relations, we only consider the category \texttt{tvSeries}. This category list a TV-Series as a single element, not divided in multiple episodes. In this way we will loose some of the relations with minor actors that may appear in just a few episodes. But we will have preserved the relations between the protagonists of the show. \s The reason to only consider this categories is purely to optimize the performance during the computation. On IMDb each episode is listed as a single element: to remove them without loosing the most important relations, we only consider the category \texttt{tvSeries}. This category lists a TV-Series as a single element, not divided in multiple episodes. In this way we will loose some of the relations with minor actors that may appear in just a few episodes. But we will have preserved the relations between the protagonists of the show. \s
\noindent Then we can generate the final filtered file \texttt{FilmFiltrati.txt} that has only two columns: \texttt{tconst} and \texttt{primaryTitle} \noindent Then we can generate the final filtered file \texttt{FilmFiltrati.txt} that has only two columns: \texttt{tconst} and \texttt{primaryTitle}
@ -121,7 +121,7 @@ This file is needed for the analysis of both graphs, but there some different ob
\textsc{Movies Graph} \s \textsc{Movies Graph} \s
\noindent For this graph we don't need any optimization on this file. We just clean clean the output and leave the rest as it is \s \noindent For this graph we don't need any optimization on this file. We just clean clean the output and leave the rest as it is. \s
\nd At the end, for both graph, we can finally generate the file \texttt{Relazioni.txt} containing the columns \texttt{tconst} and \texttt{nconst} \nd At the end, for both graph, we can finally generate the file \texttt{Relazioni.txt} containing the columns \texttt{tconst} and \texttt{nconst}
@ -134,6 +134,6 @@ This file is necessary just in the analysis of the movie graph, it won't be even
\item \texttt{numVotes} \item \texttt{numVotes}
\end{itemize} \end{itemize}
\nd The idea behind the optimization made in this file is the same that we have used before with the \texttt{MINMOVIES} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movies has received on the IMDB website. To do that we introduce the constant \texttt{VOTES}, considering only the movies with an higher number of votes.During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s \nd The idea behind the optimization made in this file is the same that we have used before with the \texttt{MINMOVIES} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movie has received on the IMDB website. To do that we introduce the constant \texttt{VOTES}, considering only the movies with an higher number of votes. During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s
\nd In this case we don't have to generate a new file, we can apply this condition to \texttt{FilmFiltrati.txt} \nd In this case we don't have to generate a new file, we can apply this condition to \texttt{FilmFiltrati.txt}

@ -1,23 +1,23 @@
\section{Introduction} \section{Introduction}
A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set \emph{nodes}, and $E \subseteq V \times V, ~ E = \{(v_i,v_j),...\}$ is the set of \emph{edges} (with $|E| = m \leq n^2$). \s A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set of \emph{nodes}, and $E \subseteq V \times V, ~ E = \{(v_i,v_j),...\}$ is the set of \emph{edges} (with $|E| = m \leq n^2$). \s
In this paper we discuss the problem of identifying the most central nodes in a network using the measure of \emph{closeness centrality}. Given a connected graph, the closeness centrality of a node $v \in V$ is defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Normalizing we obtain the following formula: \nd In this paper we discuss the problem of identifying the most central nodes in a network using the measure of \emph{closeness centrality}. Given a connected graph, the closeness centrality of a node $v \in V$ is defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Normalizing, we obtain the following formula:
\begin{equation}\label{closeness} \begin{equation}\label{closeness}
c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)} c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)}
\end{equation} \end{equation}
where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \s \nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \s
\noindent As case study we are using the collaboration network in the \emph{Internet Movie Database} (IMDB). We are going to consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where \noindent As case study we are using the collaboration network in the \emph{Internet Movie Database} (IMDB). We are going to consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where
\begin{itemize} \begin{itemize}
\item the vertex $V$ are the actor and the actress \item The vertices $V$ are the actor and the actress
\item the non oriented edges in $E$ links the actors and the actresses if they played together in a movie. \item The non oriented edges in $E$ links the actors and the actresses if they played together in a movie.
\end{itemize} \end{itemize}
For the second one we do the opposite thing: we define an undirected graph $G=(V,E)$ where For the second one we do the opposite thing: we define an undirected graph $G=(V,E)$ where
\begin{itemize} \begin{itemize}
\item the vertices $V$ are the movies \item the vertices $V$ are the movies
\item the non oriented edges in $E$ links two movies if they have an actor or actress in common \item the non oriented edges in $E$ links two movies if they have an actor or actress in common.
\end{itemize} \end{itemize}
\subsection{The Problem} \subsection{The Problem}

Binary file not shown.

@ -14,6 +14,7 @@
\usepackage{hyperref} \usepackage{hyperref}
\usepackage{textcomp} \usepackage{textcomp}
\usepackage{qrcode} \usepackage{qrcode}
\graphicspath{ {../figures/} }
\newcommand{\N}{\mathbb{N}} \newcommand{\N}{\mathbb{N}}
\newcommand{\Z}{\mathbb{Z}} \newcommand{\Z}{\mathbb{Z}}
@ -30,6 +31,7 @@
\newcommand{\Zmn}{\Z/mn\Z} \newcommand{\Zmn}{\Z/mn\Z}
\newcommand{\s}{\vspace*{0.4 cm}} \newcommand{\s}{\vspace*{0.4 cm}}
\newcommand{\nd}{\noindent} \newcommand{\nd}{\noindent}
% \newcommand{\mactors}{\texttt{MIN\textunderscore ACTORS}}
\definecolor{codegreen}{rgb}{0,0.6,0} \definecolor{codegreen}{rgb}{0,0.6,0}

@ -0,0 +1,50 @@
\section{Visualization of the graphs}
Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. Since we are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page. \s
\nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s
\nd All this section is implemented in python using the library \texttt{pyvis}. The goal of this library is to build a python based approach to constructing and visualizing network graphs in the same space. A pyvis network can be customized on a per node or per edge basis. Nodes can be given colors, sizes, labels, and other metadata. Each graph can be interacted with, allowing the dragging, hovering, and selection of nodes and edges. Each graph's layout algorithm can be tweaked as well to allow experimentation with rendering of larger graphs. It is designed as a wrapper around the popular Javascript \texttt{visJS} library
\subsection{Actors Graph} \label{actors-graph-vis}
For the actors graph, we take the subset $S$ as the actors and actresses with at least 100 movies made in their carrier. We can immediately deduct that this subset will be characterized by actors and actresses of a certain age. It takes time to make 100 movies. But as we have seen, having an high number of movies made, it's a good estimator for the closeness centrality. It's important to keep in mind that the graph will only show the relations within this subset. This means that even if an actor has 100 movies made in his carrier, in this graph the relative node may have just a few relations. We can see this graph as a collaboration network only between the most popular actors and actresses. \s
\nd An interactive version can be found at this web page. It will take a few seconds to render, it's better to use a computer and not a smartphone. \s
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-graph.html}
\begin{center}
\s \nd \qrcode{https://lukefleed.xyz/imdb-graph.html}
\end{center}
\begin{figure}[H] \label{imdb-a-network}
\centering
\includegraphics[width=13cm]{Screenshot.png}
\caption{\emph{The collaboration network of the actors and actresses with more that an 100 movies on the IMDb network}}
\end{figure}
The result obtained is extremely interesting. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood}'s one that are almost isolated. Due to cultural and linguistic differences those actors never collaborated with anyone outside their country. \s
A visual analysis of this graph can reflects some of the proprieties that we saw during the analysis of the results. Let's take the biggest cluster, the Bollywood one. Even if it's very dense and the nodes have a lot of links, none of them ever appeared in out top-k results during the testing. This happens due to the proprieties of closeness centrality, the one that we are taking into consideration. It can be seen as the ability of a node to transport information efficiently into the graph. But the Bollywood's nodes are efficient in transporting information only in their communities since they don't collaborate with nodes of other clusters. \s
A simple and heuristic way to see this phenomena is by selecting in the interactive graph a node with an higher centrality and dragging him around. It will move and influence almost every community. If we repeat the same action with a Bollywood node, it will only move the nodes of his community, leaving almost un-moved all the other nodes.
\subsection{Movies Graph}
The methodology used for this graph is basically the same of \ref{actors-graph-vis}, however the results are slightly different. \s
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-movie-graph.html}
\begin{center}
\s \nd \qrcode{https://lukefleed.xyz/imdb-movie-graph.html}
\end{center}
\s
\begin{figure}[H] \label{imdb-m-network}
\centering
\includegraphics[width=13cm]{movie-graph.png}
\caption{\emph{The network of the movies with more that an 500 votes on the IMDb database}}
\end{figure}
Even if at a first sight it may seem completely different from the previous one, it is not. As we can see, there are no evident communities. But some areas are more dense than other. If we zoom in in one of those areas we can see that the movies are often related. If there is a saga of popular movies, they will be very close in this graph. It's easy to find some big neighborhoods as the MCU (Marvel Cinematic Universe) one. \s
\nd Since we are considering about the top thousand most popular nodes, those movies are mostly from the Hollywood scene. So it makes sense that there are not isolated clusters.

@ -1,29 +0,0 @@
\section{Visualization of the graphs}
Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. To do that we need to make some sacrifices. We are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page as I did. \s
\nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s
\nd All this section is implemented in python using the library \texttt{pyvis}. The goal of this library is to build a python based approach to constructing and visualizing network graphs in the same space. A pyvis network can be customized on a per node or per edge basis. Nodes can be given colors, sizes, labels, and other metadata. Each graph can be interacted with, allowing the dragging, hovering, and selection of nodes and edges. Each graph's layout algorithm can be tweaked as well to allow experimentation with rendering of larger graphs. It is designed as a wrapper around the popular Javascript \texttt{visJS} library
\subsection{Actors Graph}
For the actors graph we choose the subset $S$ as the actors with at least 100 movies made in their carrier. We can immediately deduct that this subset will be characterized by actors and actresses of a certain age. But as we have seen, having an high number of movies made it's a good estimator for the closeness centrality. It's important to keep in mind that the graph will only show the relations nodes in this subset. This means that even if an actor has 100 movies made in his carrier, in this graph may have just a few relations. We can see this graph as collaboration network between the most popular actors and actresses. \s
\nd An interactive version can be found at this web page. It will take a few seconds to render, it's better to use a computer and not a smartphone. \s
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-graph.html}
\begin{center}
\s \nd \qrcode{https://lukefleed.xyz/imdb-graph.html}
\end{center}
\begin{figure}[H] \label{imdb-a-network}
\centering
\includegraphics[width=13cm]{Screenshot.png}
\caption{\emph{The collaboration network of the actors and actresses with more that an 100 movies on the IMDb network}}
\end{figure}
The results obtained is extremely interesting. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood}'s one that are almost isolated. Due to cultural and linguistic differences those actors never collaborated with anyone outside their country. \s
A visual analysis of this graph can reflects some of the proprieties that we saw during the analysis of the results. Let's take the biggest cluster, the Bollywood one. Even if it's very dense and the nodes have a lot of links, none of them ever appeared in out top-k results during the testing. This happens due to the proprieties of closeness centrality, the one that we are taking into consideration. It can be seen as the ability of a node to transport information efficiently into the graph. But the Bollywood's nodes are efficient in transporting information only in their communities. \s
A simple and heuristic way to see this phenomena is by grabbing in the interactive graph a node with an higher centrality and dragging him around. We'll see that it will drag with him every community. If we repeat the same action with a Bollywood node, it will only move the nodes of his community, leaving almost un-moved all the other nodes
Loading…
Cancel
Save