main
Luca Lombardo 3 years ago
parent 3e88899b22
commit c326c851ad

@ -1,4 +1,4 @@
Understanding and investigating social structures is essential in the modern world. Through the use of networks and graph theory we can find the most central elements in a community. In particolar, given a connected graph $G=(V,E)$, the closeness centrality of a vertex $v$ is defined as $ \frac{n-1}{\sum_{w \in V} d(v,w)}$. This measure can be seen as the efficiency of a node to pass information through all the other nodes in the graph. In this paper we will discuss and algorithm and its result for finding the top-k most central elements in web-scale graphs. As a case study, we are going to use the IMDB collaboration network, building two completely different graphs and analyzing their proprieties. Understanding and investigating social structures is essential in the modern world. Through the use of networks and graph theory we can find the most central elements in a community. In particolar, given a connected graph $G=(V,E)$, the closeness centrality of a vertex $v$ is defined as $ \frac{n-1}{\sum_{w \in V} d(v,w)}$. This measure can be seen as the efficiency of a node to pass information through all the other nodes in the graph. In this paper we will discuss an algorithm and its result in finding the top-k most central elements in web-scale graphs. As a case study, we are going to use the IMDB collaboration network, building two completely different graphs and analyzing their proprieties.
% Given a connected graph $G=(V,E)$, the closeness centrality of a vertex $v$ is defined as $ \frac{n-1}{\sum_{w \in V} d(v,w)}$. This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the $k$ most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks. I propose an algorithm for selecting the $k$ most central nodes in a graph: I experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. Finally, as a case study, I compute the $10$ most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie. % Given a connected graph $G=(V,E)$, the closeness centrality of a vertex $v$ is defined as $ \frac{n-1}{\sum_{w \in V} d(v,w)}$. This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the $k$ most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks. I propose an algorithm for selecting the $k$ most central nodes in a graph: I experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. Finally, as a case study, I compute the $10$ most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie.

@ -1,4 +1,5 @@
\section{Analysis of the results} \section{Analysis of the results}
In this section we are going to discuss the results of the top-k algorithm applied to the IMDb graphs. We are particularly interested in two factors: In this section we are going to discuss the results of the top-k algorithm applied to the IMDb graphs. We are particularly interested in two factors:
\begin{itemize} \begin{itemize}
\item The time needed to for the execution in function of different filtering values. \item The time needed to for the execution in function of different filtering values.

@ -7,26 +7,26 @@ A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set o
c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)} c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)}
\end{equation} \end{equation}
\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ nodes with the higher closeness centrality. \s \nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool in the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ nodes with the higher closeness centrality. \s
\noindent As case study we are using the collaboration network in the \emph{Internet Movie Database} (IMDB). We are going to consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where \noindent As case study we will use the collaboration network in the \emph{Internet Movie Database} (IMDB). We will consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where
\begin{itemize} \begin{itemize}
\item The nodes $V$ are the actors and the actress \item The nodes $V$ are the actors and the actress
\item The non oriented edges in $E$ links two nodes if they played together in a movie. \item The non oriented edges in $E$ links two nodes if they played together in a movie.
\end{itemize} \end{itemize}
For the second one we do the opposite thing: we define an undirected graph $G=(V,E)$ where For the second one we will do the opposite thing. We define an undirected graph $G=(V,E)$ where:
\begin{itemize} \begin{itemize}
\item the nodes $V$ are the movies \item the nodes $V$ are the movies.
\item the non oriented edges in $E$ links two movies if they have an actor or actress in common. \item the non oriented edges in $E$ links two movies if they have an actor or actress in common.
\end{itemize} \end{itemize}
\clearpage \clearpage
\subsection{The Problem} \subsection{The Problem}
Since we are dealing with a web-scale network any brute force algorithm would require years to end. The main difficulty here is caused by the computation of distance $d(v,w)$ in \eqref{closeness}. This is a well know problem known as \emph{All Pairs Shortest Paths or APSP problem}. \s Since we are dealing with a web-scale network any brute force algorithm would require years to end. The main difficulty here is caused by the computation of distance $d(v,w)$ in \eqref{closeness}. This is a well know problem known as \emph{All Pairs Shortest Paths} (or \emph{APSP problem}). \s
\noindent We can solve the APSP problem either using the fast matrix multiplication or, as made in this paper, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \s \noindent We can solve the APSP problem either using the fast matrix multiplication or, as made in this paper, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \s
\noindent A graph is a data structure and we can describe it in different ways \cite{skienna08}. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a generic number if weighted) otherwise. This method requires $O(n^2)$ space in memory. That is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373} \log n)\}$ \cite{10.1145/567112.567114} \s \noindent A graph is a data structure and we can describe it in different ways \cite{skienna08}. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a specific number if weighted) otherwise. This method requires $O(n^2)$ space in memory. That is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373} \log n)\}$ \cite{10.1145/567112.567114} \s
\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It has also been proven that this method can not be improved. In this paper I propose an exact algorithm to compute the top-$k$ nodes with the higher closeness centrality. \noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It has also been proven that this method can not be improved. In this paper I propose an exact algorithm to compute efficiently only the top-$k$ nodes with the higher closeness centrality.

Binary file not shown.
Loading…
Cancel
Save