@ -18,14 +18,14 @@ where $r(v) = |R(v)|$ is the cardinality of the set of reachable nodes from $v$.
With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0
\subsection{The lower bound technique}
During the computation of the farness, for each node, we have to compute the distance from that node and to all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$ list. \s
During the computation of the farness, for each node, we have to compute the distance from that node to all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central nodes we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$ list. \s
\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. If at some point the lower bound tell us that the node will not be in the top-$k$, this will allow us to kill the BFS operation before it reaches the end. More precisely:
\begin{itemize}
\item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top}. From now on, this vector will be full.
\item Then, for all the next nodes, it defines a lower bound
\item Then, for all the following nodes, it defines a lower bound
\begin{equation}\label{lower-bound}
\frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)
\end{equation}
@ -35,8 +35,6 @@ During the computation of the farness, for each node, we have to compute the dis
The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need to consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all nodes are reachable form $v$.
\State\texttt{Skip = True}; \Comment{Stop the BFS}
\Else
\Else
\State Compute the farness; \Comment{BFS reached the end}
\State\texttt{Top.pop\textunderscore back}; \Comment{Remove the last element}
\State Add the new node, in order of farness;
@ -76,4 +74,4 @@ During the computation of the farness, for each node, we have to compute the dis
\nd In Algorithm \ref*{alg:lowerbound-technique} we use a list \texttt{Top} containing the top analyzed (yet) nodes in increasing order of farness. Then we need a vector of booleans \texttt{enqueued} to see which nodes we put in the queue during the BFS. During the BFS we need a "FIFO" priority queue \texttt{Q}. All the technical details are left to the reader in the GitHub repository.
\nd In Algorithm \ref*{alg:lowerbound-technique} we use a list \texttt{Top} containing the top analyzed (yet) nodes in increasing order of farness. Then we need a vector of booleans \texttt{enqueued} to see which nodes we put in the queue during the BFS. During the BFS we need a "FIFO" priority queue \texttt{Q}. All the technical details can be found in the GitHub repository.
@ -10,7 +10,7 @@ The first one will tell us how much more efficient the algorithm is in terms of
\nd The platform for the tests is \emph{a laptop}, so can not be considered precise due factors as thermal throttling. The CPU is an Intel(R) Core™ i7-8750H (6 cores, 12 threads), equipped with 16GB of DDR4 @2666 MHz RAM.
\subsection{Actors graph}\label{actors-graph}
Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In this case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation. \s
Let's take into analysis the graph were each actor/actress is a node and two nodes are linked the if they played in a movie together. In this case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimum number of movies that an actor/actress needs to have done to be considered in the computation. \s
\nd Varying this variable obviously affects the algorithm, in different ways. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors.
@ -80,7 +80,7 @@ As seen during the analysis of the actors graph in \ref{actors-graph}, varying t
\subsubsection{Time of execution}
As seen in \ref{time-actors} we are going to analyze the performance of the algorithm in function of different values of \texttt{VOTES}. Low values of this variable will lead to and exponential growth of the cardinality of the nodes and edges set cardinality. And as we know, with a bigger graph there are more operations to do. The results are shown in figure \ref{fig:moves_time}
As seen in \ref{time-actors} we are going to analyze the performance of the algorithm in function of different values of \texttt{VOTES}. Low values of this variable will lead to and exponential growth of the cardinality of the nodes and edges sets cardinalities. And as we know, with a bigger graph there are more operations to do. The results are shown in figure \ref{fig:moves_time}
\begin{figure}[h!]
\centering
@ -94,12 +94,12 @@ As seen in \ref{time-actors} we are going to analyze the performance of the algo
\subsubsection{Discrepancy of the results}
All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{fig:matrix-a}), we are going to use a matrix to visualize and analyze the results.
All the observations made before are still valid for this case. As done before (\ref{fig:matrix-a}), we are going to use a matrix to visualize and analyze the results.
In this paper we discussed the results of an exact algorithm for the computation of the $k$ most central nodes in a graph, according to closeness centrality. We saw that with the introduction of a lower bound, the real word performance are way better than a brute force algorithm that compute all the \texttt{BFS}. \s
\nd Since there were no server with dozens of threads and hundreds of Gigs of RAM, every idea has been implement knowing that it needed to run fine on a regolar laptop. This condition lead to interesting implementations for the filtering on the raw data. \s
\nd Since there were no server with dozens of threads and hundreds of Gigs of RAM, every idea has been implement knowing that it needed to run on a regular laptop. This condition lead to interesting implementations for the filtering on the raw data. \s
\nd We have seen two different case studies, both based on the IMDb network. For each of them we had to find a way to filter the data without loosing accuracy on the results. We saw that with an harder filtering, we gain a lot of performance, but the results showed an increasing discrepancy from the reality. Analyzing those tests made, we have been able to find, for both graphs, a balance that gives accuracy and performance at the same time.
\s\nd This work is heavily based on \cite{DBLP:journals/corr/BergaminiBCMM17}. Even if this article use a more complex and complete approach, the results on the IMDb case study are almost identical. They worked with snapshot, analyzing single time periods, so there are some inevitable discrepancies. Despite that, most of the top-$k$ actors are the same and the closeness centrality values are very similar. We can use this comparison to attest the truthfulness and efficiency of the algorithm presented in this paper.
\s\nd This work is heavily based on \cite{DBLP:journals/corr/BergaminiBCMM17}. Even if this article use a more complex and complete approach, the results on the IMDb case study are almost identical. They worked with snapshots, analyzing single time periods, so there are some inevitable discrepancies. Despite that, most of the top-$k$ actors are the same and the closeness centrality values are very similar. We can use this comparison to attest the truthfulness and efficiency of the algorithm presented in this paper.
@ -8,16 +8,16 @@ The algorithm shown in this paper is very versatile. We have tested it with two
h(v) = \sum_{w \neq v}\frac{1}{d(v,w)}
\end{equation}
\nd The main difference here is that we don't have a farness. Then we won't need a lower bound either. Since the biggest the number is the higher is the centrality we have to adapt the algorithm. Instead of a lower bound, we need an upper bound $U_B$ such that
\nd The main difference here is that we don't have a farness. Then we won't need a lower bound either. Since the biggest the number is the higher is the centrality, we have to adapt the algorithm. Instead of a lower bound, we need an upper bound $U_B$ such that
\begin{equation}
h(v) \leq U_B (v) \leq h(w)
h(v) \leq U(v) \leq h(w)
\end{equation}
\nd A possibile upper bound can be taken considering the worst case that could happen at each state
\nd When we are at the level $d$ of our exploration, we already know the partial sum $\sigma_{d-1}$. The worst case in this level happens when the node $v$ is connected to all the other nodes. To consider this possibility we add the factors $\frac{n_d}{d}+\frac{n - r - n_d}{d+1}$.
\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool in the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ nodes with the higher closeness centrality. \s
\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool in the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition gives us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ nodes with the higher closeness centrality. \s
\noindent As case study we will use the collaboration network in the \emph{Internet Movie Database} (IMDB). We will consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where
\begin{itemize}
\item The nodes $V$ are the actors and the actress
\item The nodes $V$ are the actors and the actresses
\item The non oriented edges in $E$ links two nodes if they played together in a movie.
\end{itemize}
For the second one we will do the opposite thing. We define an undirected graph $G=(V,E)$ where:
@ -23,10 +23,10 @@ For the second one we will do the opposite thing. We define an undirected graph
\clearpage
\subsection{The Problem}
Since we are dealing with a web-scale network any brute force algorithm would require years to end. The main difficulty here is caused by the computation of distance $d(v,w)$ in \eqref{closeness}. This is a well know problem known as \emph{All Pairs Shortest Paths} (or \emph{APSP problem}). \s
Since we are dealing with a web-scale network, any brute force algorithm would require years to end. The main difficulty here is caused by the computation of distance $d(v,w)$ in \eqref{closeness}. This is a well know problem, known as \emph{All Pairs Shortest Paths} (or \emph{APSP problem}). \s
\noindent We can solve the APSP problem either using the fast matrix multiplication or, as made in this paper, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \s
\noindent We can solve the APSP problem either using the fast matrix multiplication or, as in this paper, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \s
\noindent A graph is a data structure and we can describe it in different ways \cite{skienna08}. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a specific number if weighted) otherwise. This method requires $O(n^2)$ space in memory. That is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373}\log n)\}$\cite{10.1145/567112.567114}\s
\noindent A graph is a data structure and we can describe it in different ways \cite{skienna08}. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a specific number if weighted) otherwise. This method requires $O(n^2)$ space in memory. That is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373}\log n)$\cite{10.1145/567112.567114}\s
\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It has also been proven that this method can not be improved. In this paper I propose an exact algorithm to compute efficiently only the top-$k$ nodes with the higher closeness centrality.
\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It has also been proven that this method can not be improved. In this paper we propose an exact algorithm to compute efficiently only the $k$ nodes with the higher closeness centrality.
@ -24,7 +24,7 @@ For the actors graph, we take the subset $S$ as the actors and actresses with at
\end{figure}
The result obtained is extremely interesting as shown in \ref{fig:imdb-a-network}. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood}'s one that are almost isolated. Due to cultural and linguistic differences those actors never collaborated with anyone outside their country. \s
The result obtained is extremely interesting as shown in \ref{fig:imdb-a-network}. We can clearly see how this graph it's characterized by different (and some times isolated) communities. The nodes in them are all actors and actresses of the same nationality. There are some very big clusters as the \emph{Bollywood} one that are almost isolated. Due to cultural and linguistic differences those actors have never collaborated with anyone outside their country. \s
A visual analysis of this graph can reflects some of the proprieties that we saw during the analysis of the results. Let's take the biggest cluster, the Bollywood one. Even if it's very dense and the nodes have a lot of links, none of them ever appeared in out top-k results during the testing. This happens due to the proprieties of closeness centrality, the one that we are taking into consideration. It can be seen as the ability of a node to transport information efficiently into the graph. But the Bollywood's nodes are efficient in transporting information only in their communities since they don't collaborate with nodes of other clusters. \s