imdb-graph/tex/algorithm.tex

\section{The algorithm}

In a connected graph, given a node $v \in V$, we can define the its farness as

\begin{equation}
    f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w)
\end{equation}
where $c(v)$ is the closeness centrality defined in \eqref{closeness}. Since we are working with a disconnected graph, a natural generalization of this formula is

\begin{equation}\label{wrong-farness}
    f(v) = \frac{1}{c(v)} = \frac{1}{r(v)-1} \displaystyle \sum_{w \in V} d(v,w)
\end{equation}
where $r(v) = |R(v)|$ is the cardinality of the set of reachable nodes from $v$. To avoid any problem during the computation, this formula still needs to be modified. Let's assume that the node $v$ that we are considering has just one link at distance $1$ with another node $w$ with \emph{out-degree} 0. If we consider the formula \eqref{wrong-farness} we will get a false result: $v$ would appear to be very central, even if it's obviously very peripheral. To avoid this problem, we can generalize the formula \eqref{wrong-farness} normalizing as suggested in \texttt{[Lin 1976; Wasserman and Faust 1994; Boldi and Vigna 2013; 2014; Olsen et al. 2014]}

\begin{equation}\label{farness}
    f(v) = \frac{n-1}{(r(v)-1)^2} \sum_{w \in R(v)} d(v,w)
\end{equation}
With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0

\subsection{The lower bound technique}
During the computation of the farness, for each node, we have to compute the distance from that node and all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \s

\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. This will allow us to kill the BFS operation before reaches the end if the lower bound tell us that the node will not be in the top-$k$. More precisely:

\begin{itemize}
    \item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top-actors}. From now on, this vector will be full.

    \item Then, for all the next vertices, it defines a lower bound
    \begin{equation}\label{lower-bound}
        \frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)
    \end{equation}

    where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top-actors}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top-actors} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top-actors} anyway. \s

    The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all vertices are reachable form $v$.
\end{itemize}

\textsc{Scrivere pseudocodice}


% \begin{algorithmic}[H] \caption{How to write algorithms}
%     \KwIn{A graph $G = (V,E)$}
%     \KwOut{Top-$k$ nodes with higher closeness centrality and their value} \
%     \

%     \While{not at end of this document}{
%      read current\;
%      \eIf{understand}{
%       go to next section\;
%       current section becomes this one\;
%       }{
%       go back to the beginning of current section\;
%      }
%     }

%    \end{algorithmic}
starting to write the paper 3 years ago			`\section{The algorithm}`

			`In a connected graph, given a node $v \in V$, we can define the its farness as`

			`\begin{equation}`
			`f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w)`
			`\end{equation}`
			`where $c(v)$ is the closeness centrality defined in \eqref{closeness}. Since we are working with a disconnected graph, a natural generalization of this formula is`

			`\begin{equation}\label{wrong-farness}`
			`f(v) = \frac{1}{c(v)} = \frac{1}{r(v)-1} \displaystyle \sum_{w \in V} d(v,w)`
			`\end{equation}`
new chapters 3 years ago			where $r(v) = \|R(v)\|$ is the cardinality of the set of reachable nodes from $v$. To avoid any problem during the computation, this formula still needs to be modified. Let's assume that the node $v$ that we are considering has just one link at distance $1$ with another node $w$ with \emph{out-degree} 0. If we consider the formula \eqref{wrong-farness} we will get a false result: $v$ would appear to be very central, even if it's obviously very peripheral. To avoid this problem, we can generalize the formula \eqref{wrong-farness} normalizing as suggested in \texttt{[Lin 1976; Wasserman and Faust 1994; Boldi and Vigna 2013; 2014; Olsen et al. 2014]}
starting to write the paper 3 years ago
			`\begin{equation}\label{farness}`
			`f(v) = \frac{n-1}{(r(v)-1)^2} \sum_{w \in R(v)} d(v,w)`
			`\end{equation}`
new chapters 3 years ago			`With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0`
starting to write the paper 3 years ago
			`\subsection{The lower bound technique}`
new chapters 3 years ago			`During the computation of the farness, for each node, we have to compute the distance from that node and all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \s`
starting to write the paper 3 years ago
			`\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. This will allow us to kill the BFS operation before reaches the end if the lower bound tell us that the node will not be in the top-$k$. More precisely:`

			`\begin{itemize}`
new chapters 3 years ago			`\item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top-actors}. From now on, this vector will be full.`
starting to write the paper 3 years ago
			`\item Then, for all the next vertices, it defines a lower bound`
			`\begin{equation}\label{lower-bound}`
			`\frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)`
			`\end{equation}`

new chapters 3 years ago			where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top-actors}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top-actors} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top-actors} anyway. \s
starting to write the paper 3 years ago
new chapters 3 years ago			`The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all vertices are reachable form $v$.`
starting to write the paper 3 years ago			`\end{itemize}`

			`\textsc{Scrivere pseudocodice}`


			`% \begin{algorithmic}[H] \caption{How to write algorithms}`
			`% \KwIn{A graph $G = (V,E)$}`
			`% \KwOut{Top-$k$ nodes with higher closeness centrality and their value} \`
			`% \`

			`% \While{not at end of this document}{`
			`% read current\;`
			`% \eIf{understand}{`
			`% go to next section\;`
			`% current section becomes this one\;`
			`% }{`
			`% go back to the beginning of current section\;`
			`% }`
			`% }`

			`% \end{algorithmic}`