analysis almost complete, visualization complete
@ -1,35 +0,0 @@
|
||||
\section{An overview of the code}
|
||||
The algorithm implement is multi-threaded and written in C\texttt{++}
|
||||
|
||||
\subsection{Data structures}
|
||||
In this case we are working with two simple \texttt{struct} for the classes \emph{Film} and \emph{Actor}
|
||||
|
||||
\lstinputlisting[language=c++]{code/struct.cpp}
|
||||
\s
|
||||
\nd Then we need two dictionaries build like this
|
||||
|
||||
\lstinputlisting[language=c++]{code/map.cpp}
|
||||
\s
|
||||
\nd We are considering the files \texttt{Attori.txt} and \texttt{FilmFiltrati.txt}, we don't need the relations one for now. Once that we have read this two files, we loop on each one brutally filling the two dictionaries created before. If a line is empty, we skip it. We are using a try and catch approach. Even if the good practice is to use it only for a specific error, since we are outputting everything on the terminal it makes sense to \emph{catch} any error.
|
||||
|
||||
\lstinputlisting[language=c++]{code/data.cpp}
|
||||
\s
|
||||
|
||||
Now we can use the file \texttt{Relazioni.txt}. As before, we loop on all the elements of this file, creating the variables
|
||||
|
||||
\begin{itemize}
|
||||
\item \texttt{id\textunderscore film}: index key of each movie
|
||||
\item \texttt{id\textunderscore attore}: index key of each actor
|
||||
\end{itemize}
|
||||
|
||||
\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we updated the list of indices of actors/actresses that played in the movies with that id.
|
||||
|
||||
\lstinputlisting[language=c++]{code/graph.cpp}
|
||||
\s
|
||||
Now that we have defined how to build this graph, we have to implement the algorithm what will return the top-k central elements. \s
|
||||
|
||||
\nd The code can be found here: \url{https://github.com/lukefleed/imdb-graph}
|
||||
\s
|
||||
\begin{center}
|
||||
\qrcode[height=1in]{https://github.com/lukefleed/imdb-graph}
|
||||
\end{center}
|
Before Width: | Height: | Size: 72 KiB After Width: | Height: | Size: 72 KiB |
Before Width: | Height: | Size: 86 KiB After Width: | Height: | Size: 86 KiB |
Before Width: | Height: | Size: 517 KiB After Width: | Height: | Size: 517 KiB |
After Width: | Height: | Size: 72 KiB |
After Width: | Height: | Size: 1.0 MiB |
After Width: | Height: | Size: 57 KiB |
@ -0,0 +1,88 @@
|
||||
\section{An overview of the code}
|
||||
The algorithm implement is multi-threaded and written in C\texttt{++}. To avoid redundances, we'll take in exame only the \emph{Actors Graph} case.
|
||||
|
||||
\subsection{Data structures}
|
||||
In this case we are working with two simple \texttt{struct} for the classes \emph{Film} and \emph{Actor}
|
||||
|
||||
\lstinputlisting[language=c++]{code/struct.cpp}
|
||||
\s
|
||||
\nd Then we need two dictionaries build like this
|
||||
|
||||
\lstinputlisting[language=c++]{code/map.cpp}
|
||||
\s
|
||||
\nd We are considering the files \texttt{Attori.txt} and \texttt{FilmFiltrati.txt}, we don't need the relations one for now. Once that we have read this two files, we loop on each one brutally filling the two dictionaries created before. If a line is empty, we skip it. We are using a try and catch approach. Even if the good practice is to use it only for a specific error, since we are outputting everything on the terminal it makes sense to \emph{catch} any error.
|
||||
|
||||
\lstinputlisting[language=c++]{code/data.cpp}
|
||||
\s
|
||||
|
||||
Now we can use the file \texttt{Relazioni.txt}. As before, we loop on all the elements of this file, creating the variables
|
||||
|
||||
\begin{itemize}
|
||||
\item \texttt{id\textunderscore film}: index key of each movie
|
||||
\item \texttt{id\textunderscore attore}: index key of each actor
|
||||
\end{itemize}
|
||||
|
||||
\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we update the list of indices of actors/actresses that played in the movies with that id.
|
||||
|
||||
\lstinputlisting[language=c++]{code/graph.cpp}
|
||||
\s
|
||||
Now that we have defined how to build this graph, we have to implement the algorithm what will return the top-k central elements. \s
|
||||
|
||||
\nd The code can be found here: \url{https://github.com/lukefleed/imdb-graph}
|
||||
\s
|
||||
\begin{center}
|
||||
\qrcode{https://github.com/lukefleed/imdb-graph}
|
||||
\end{center}
|
||||
|
||||
\subsection{Results - Actors Graph}
|
||||
|
||||
Here are the top-10 actors for closeness centrality obtained with the variable \texttt{MIN\textunderscore ACTORS=5} (as we'll see in the next section, it's the most accurate)
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{||c c||}
|
||||
\hline
|
||||
Node & Closeness centrality \\ [0.5ex]
|
||||
\hline\hline
|
||||
Eric Roberts & 0.324895 \\
|
||||
Christopher Lee &0.319873 \\
|
||||
Franco Nero & 0.31946 \\
|
||||
John Savage & 0.316258 \\
|
||||
Michael Madsen & 0.314451 \\
|
||||
Udo Kier & 0.31357 \\
|
||||
Geraldine Chaplin & 0.313141 \\
|
||||
Malcolm McDowell & 0.313014 \\
|
||||
David Carradine & 0.312648 \\
|
||||
Christopher Plummer & 0.311859 \\ [1ex]
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\nd All the other results are available in the Github repository for all the values of \texttt{MIN\textunderscore ACTORS} and for $k=100$
|
||||
|
||||
\newpage
|
||||
\subsection{Results - Movies Graph}
|
||||
|
||||
Here are the top-10 movies for closeness centrality obtained with the variable \texttt{VOTES=500} (as we'll see in the next section, it's the most accurate)
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{||c c||}
|
||||
\hline
|
||||
Node & Closeness centrality \\ [0.5ex]
|
||||
\hline\hline
|
||||
Merlin & 0.290731 \\
|
||||
The Odyssey & 0.290314 \\
|
||||
The Color of Magic & 0.285208 \\
|
||||
The Godfather Saga & 0.284932 \\
|
||||
Jack and the Beanstalk: The Real Story & 0.283522 \\
|
||||
In the Beginning & 0.28347 \\
|
||||
RED 2 & 0.283362 \\
|
||||
Lonesome Dove & 0.283353 \\
|
||||
Moses & 0.282953 \\
|
||||
Species & 0.282642 \\ [1ex]
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
\nd All the other results are available in the Github repository for all the values of \texttt{VOTES} and for $k=100$
|
@ -1,23 +1,23 @@
|
||||
\section{Introduction}
|
||||
A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set \emph{nodes}, and $E \subseteq V \times V, ~ E = \{(v_i,v_j),...\}$ is the set of \emph{edges} (with $|E| = m \leq n^2$). \s
|
||||
A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set of \emph{nodes}, and $E \subseteq V \times V, ~ E = \{(v_i,v_j),...\}$ is the set of \emph{edges} (with $|E| = m \leq n^2$). \s
|
||||
|
||||
In this paper we discuss the problem of identifying the most central nodes in a network using the measure of \emph{closeness centrality}. Given a connected graph, the closeness centrality of a node $v \in V$ is defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Normalizing we obtain the following formula:
|
||||
\nd In this paper we discuss the problem of identifying the most central nodes in a network using the measure of \emph{closeness centrality}. Given a connected graph, the closeness centrality of a node $v \in V$ is defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Normalizing, we obtain the following formula:
|
||||
|
||||
\begin{equation}\label{closeness}
|
||||
c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)}
|
||||
\end{equation}
|
||||
|
||||
where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \s
|
||||
\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \s
|
||||
|
||||
\noindent As case study we are using the collaboration network in the \emph{Internet Movie Database} (IMDB). We are going to consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where
|
||||
\begin{itemize}
|
||||
\item the vertex $V$ are the actor and the actress
|
||||
\item the non oriented edges in $E$ links the actors and the actresses if they played together in a movie.
|
||||
\item The vertices $V$ are the actor and the actress
|
||||
\item The non oriented edges in $E$ links the actors and the actresses if they played together in a movie.
|
||||
\end{itemize}
|
||||
For the second one we do the opposite thing: we define an undirected graph $G=(V,E)$ where
|
||||
\begin{itemize}
|
||||
\item the vertices $V$ are the movies
|
||||
\item the non oriented edges in $E$ links two movies if they have an actor or actress in common
|
||||
\item the non oriented edges in $E$ links two movies if they have an actor or actress in common.
|
||||
\end{itemize}
|
||||
|
||||
\subsection{The Problem}
|
@ -0,0 +1,50 @@
|
||||
\section{Visualization of the graphs}
|
||||
Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. Since we are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page. \s
|
||||
|
||||
\nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s
|
||||
|
||||
\nd All this section is implemented in python using the library \texttt{pyvis}. The goal of this library is to build a python based approach to constructing and visualizing network graphs in the same space. A pyvis network can be customized on a per node or per edge basis. Nodes can be given colors, sizes, labels, and other metadata. Each graph can be interacted with, allowing the dragging, hovering, and selection of nodes and edges. Each graph's layout algorithm can be tweaked as well to allow experimentation with rendering of larger graphs. It is designed as a wrapper around the popular Javascript \texttt{visJS} library
|
||||
|
||||
\subsection{Actors Graph} \label{actors-graph-vis}
|
||||
For the actors graph, we take the subset $S$ as the actors and actresses with at least 100 movies made in their carrier. We can immediately deduct that this subset will be characterized by actors and actresses of a certain age. It takes time to make 100 movies. But as we have seen, having an high number of movies made, it's a good estimator for the closeness centrality. It's important to keep in mind that the graph will only show the relations within this subset. This means that even if an actor has 100 movies made in his carrier, in this graph the relative node may have just a few relations. We can see this graph as a collaboration network only between the most popular actors and actresses. \s
|
||||
|
||||
\nd An interactive version can be found at this web page. It will take a few seconds to render, it's better to use a computer and not a smartphone. \s
|
||||
|
||||
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-graph.html}
|
||||
|
||||
\begin{center}
|
||||
\s \nd \qrcode{https://lukefleed.xyz/imdb-graph.html}
|
||||
\end{center}
|
||||
|
||||
\begin{figure}[H] \label{imdb-a-network}
|
||||
\centering
|
||||
\includegraphics[width=13cm]{Screenshot.png}
|
||||
\caption{\emph{The collaboration network of the actors and actresses with more that an 100 movies on the IMDb network}}
|
||||
\end{figure}
|
||||
|
||||
The result obtained is extremely interesting. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood}'s one that are almost isolated. Due to cultural and linguistic differences those actors never collaborated with anyone outside their country. \s
|
||||
|
||||
A visual analysis of this graph can reflects some of the proprieties that we saw during the analysis of the results. Let's take the biggest cluster, the Bollywood one. Even if it's very dense and the nodes have a lot of links, none of them ever appeared in out top-k results during the testing. This happens due to the proprieties of closeness centrality, the one that we are taking into consideration. It can be seen as the ability of a node to transport information efficiently into the graph. But the Bollywood's nodes are efficient in transporting information only in their communities since they don't collaborate with nodes of other clusters. \s
|
||||
|
||||
A simple and heuristic way to see this phenomena is by selecting in the interactive graph a node with an higher centrality and dragging him around. It will move and influence almost every community. If we repeat the same action with a Bollywood node, it will only move the nodes of his community, leaving almost un-moved all the other nodes.
|
||||
|
||||
\subsection{Movies Graph}
|
||||
|
||||
The methodology used for this graph is basically the same of \ref{actors-graph-vis}, however the results are slightly different. \s
|
||||
|
||||
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-movie-graph.html}
|
||||
\begin{center}
|
||||
\s \nd \qrcode{https://lukefleed.xyz/imdb-movie-graph.html}
|
||||
\end{center}
|
||||
|
||||
\s
|
||||
|
||||
\begin{figure}[H] \label{imdb-m-network}
|
||||
\centering
|
||||
\includegraphics[width=13cm]{movie-graph.png}
|
||||
\caption{\emph{The network of the movies with more that an 500 votes on the IMDb database}}
|
||||
\end{figure}
|
||||
|
||||
Even if at a first sight it may seem completely different from the previous one, it is not. As we can see, there are no evident communities. But some areas are more dense than other. If we zoom in in one of those areas we can see that the movies are often related. If there is a saga of popular movies, they will be very close in this graph. It's easy to find some big neighborhoods as the MCU (Marvel Cinematic Universe) one. \s
|
||||
|
||||
\nd Since we are considering about the top thousand most popular nodes, those movies are mostly from the Hollywood scene. So it makes sense that there are not isolated clusters.
|
@ -1,29 +0,0 @@
|
||||
\section{Visualization of the graphs}
|
||||
Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. To do that we need to make some sacrifices. We are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page as I did. \s
|
||||
|
||||
\nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s
|
||||
|
||||
\nd All this section is implemented in python using the library \texttt{pyvis}. The goal of this library is to build a python based approach to constructing and visualizing network graphs in the same space. A pyvis network can be customized on a per node or per edge basis. Nodes can be given colors, sizes, labels, and other metadata. Each graph can be interacted with, allowing the dragging, hovering, and selection of nodes and edges. Each graph's layout algorithm can be tweaked as well to allow experimentation with rendering of larger graphs. It is designed as a wrapper around the popular Javascript \texttt{visJS} library
|
||||
|
||||
\subsection{Actors Graph}
|
||||
For the actors graph we choose the subset $S$ as the actors with at least 100 movies made in their carrier. We can immediately deduct that this subset will be characterized by actors and actresses of a certain age. But as we have seen, having an high number of movies made it's a good estimator for the closeness centrality. It's important to keep in mind that the graph will only show the relations nodes in this subset. This means that even if an actor has 100 movies made in his carrier, in this graph may have just a few relations. We can see this graph as collaboration network between the most popular actors and actresses. \s
|
||||
|
||||
\nd An interactive version can be found at this web page. It will take a few seconds to render, it's better to use a computer and not a smartphone. \s
|
||||
|
||||
\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-graph.html}
|
||||
|
||||
\begin{center}
|
||||
\s \nd \qrcode{https://lukefleed.xyz/imdb-graph.html}
|
||||
\end{center}
|
||||
|
||||
\begin{figure}[H] \label{imdb-a-network}
|
||||
\centering
|
||||
\includegraphics[width=13cm]{Screenshot.png}
|
||||
\caption{\emph{The collaboration network of the actors and actresses with more that an 100 movies on the IMDb network}}
|
||||
\end{figure}
|
||||
|
||||
The results obtained is extremely interesting. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood}'s one that are almost isolated. Due to cultural and linguistic differences those actors never collaborated with anyone outside their country. \s
|
||||
|
||||
A visual analysis of this graph can reflects some of the proprieties that we saw during the analysis of the results. Let's take the biggest cluster, the Bollywood one. Even if it's very dense and the nodes have a lot of links, none of them ever appeared in out top-k results during the testing. This happens due to the proprieties of closeness centrality, the one that we are taking into consideration. It can be seen as the ability of a node to transport information efficiently into the graph. But the Bollywood's nodes are efficient in transporting information only in their communities. \s
|
||||
|
||||
A simple and heuristic way to see this phenomena is by grabbing in the interactive graph a node with an higher centrality and dragging him around. We'll see that it will drag with him every community. If we repeat the same action with a Bollywood node, it will only move the nodes of his community, leaving almost un-moved all the other nodes
|