starting to write the paper
parent
3e7c00a023
commit
515a7d19eb
@ -0,0 +1,4 @@
|
|||||||
|
TO DO!
|
||||||
|
% Given a connected graph $G=(V,E)$, the closeness centrality of a vertex $v$ is defined as $ \frac{n-1}{\sum_{w \in V} d(v,w)}$. This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the $k$ most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks. I propose an algorithm for selecting the $k$ most central nodes in a graph: I experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. Finally, as a case study, I compute the $10$ most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie.
|
||||||
|
|
||||||
|
% Da cambiare le parole, preso dal paper
|
@ -0,0 +1,58 @@
|
|||||||
|
\section{The algorithm}
|
||||||
|
|
||||||
|
In a connected graph, given a node $v \in V$, we can define the its farness as
|
||||||
|
|
||||||
|
\begin{equation}
|
||||||
|
f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w)
|
||||||
|
\end{equation}
|
||||||
|
|
||||||
|
where $c(v)$ is the closeness centrality defined in \eqref{closeness}. Since we are working with a disconnected graph, a natural generalization of this formula is
|
||||||
|
|
||||||
|
\begin{equation}\label{wrong-farness}
|
||||||
|
f(v) = \frac{1}{c(v)} = \frac{1}{r(v)-1} \displaystyle \sum_{w \in V} d(v,w)
|
||||||
|
\end{equation}
|
||||||
|
|
||||||
|
where $r(v) = |R(v)|$ is the cardinality of the set of reachable nodes from $v$. To avoid any problem during the computation, this formula still needs to be modified. Let's assume the nodes $v$ that we are considering has just a link at distance $1$ with another node $w$ with \emph{out-degree} 0. If we consider the formula \eqref{wrong-farness} we will get a false result: $v$ would appear to be very central, even if it's obviously very peripheral. To avoid this problem, we can generalize the formula \eqref{wrong-farness} normalizing as suggested in \texttt{[Lin 1976; Wasserman and Faust 1994; Boldi and Vigna 2013; 2014; Olsen et al. 2014]}
|
||||||
|
|
||||||
|
\begin{equation}\label{farness}
|
||||||
|
f(v) = \frac{n-1}{(r(v)-1)^2} \sum_{w \in R(v)} d(v,w)
|
||||||
|
\end{equation}
|
||||||
|
With the convention that is a case of $\frac{0}{0}$ we set the closeness of $v$ to 0
|
||||||
|
|
||||||
|
\subsection{The lower bound technique}
|
||||||
|
During the computation of the farness, for each node, we have to compute the distance from that node and all the other one reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \\
|
||||||
|
|
||||||
|
\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. This will allow us to kill the BFS operation before reaches the end if the lower bound tell us that the node will not be in the top-$k$. More precisely:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top-actors}. From now on, this vector will be full
|
||||||
|
|
||||||
|
\item Then, for all the next vertices, it defines a lower bound
|
||||||
|
\begin{equation}\label{lower-bound}
|
||||||
|
\frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)
|
||||||
|
\end{equation}
|
||||||
|
|
||||||
|
where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top-actors}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top-actors} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top-actors} anyway. \\
|
||||||
|
|
||||||
|
The \eqref{lower-bound} it's a worst case scenario, and makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all vertices are reachable form $v$
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\textsc{Scrivere pseudocodice}
|
||||||
|
|
||||||
|
|
||||||
|
% \begin{algorithmic}[H] \caption{How to write algorithms}
|
||||||
|
% \KwIn{A graph $G = (V,E)$}
|
||||||
|
% \KwOut{Top-$k$ nodes with higher closeness centrality and their value} \
|
||||||
|
% \
|
||||||
|
|
||||||
|
% \While{not at end of this document}{
|
||||||
|
% read current\;
|
||||||
|
% \eIf{understand}{
|
||||||
|
% go to next section\;
|
||||||
|
% current section becomes this one\;
|
||||||
|
% }{
|
||||||
|
% go back to the beginning of current section\;
|
||||||
|
% }
|
||||||
|
% }
|
||||||
|
|
||||||
|
% \end{algorithmic}
|
@ -0,0 +1,26 @@
|
|||||||
|
\section{Introduction}
|
||||||
|
A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set \emph{nodes}, and $E \subseteq V \times V, ~ E = \{(v_i,v_j),...\}$ is the set of \emph{edges} (with $|E| = m \leq n^2$). \\
|
||||||
|
|
||||||
|
In this paper we discuss the problem of identifying the most central nodes in a network using the measure of \emph{closeness centrality}. Given a connected graph, the closeness centrality of a node $v \in V$ is defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Normalizing we obtain the following formula:
|
||||||
|
|
||||||
|
\begin{equation}\label{closeness}
|
||||||
|
c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)}
|
||||||
|
\end{equation}
|
||||||
|
|
||||||
|
where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \\
|
||||||
|
|
||||||
|
\noindent As case study we are using the collaboration graph of the actors in the \emph{Internet Movie Database} (IMDB). On this data we define an undirected graph $G=(V,E)$ where
|
||||||
|
\begin{itemize}
|
||||||
|
\item the vertex $V$ are the actor and the actress
|
||||||
|
\item the non oriented edges in $E$ links the actors and the actresses if they played together in a movie.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{The Problem}
|
||||||
|
|
||||||
|
We are dealing with a web-scale network: any brute force algorithm would require years to end. The main difficulty is caused by the computation of distance $d(v,w)$. This is a well know problem: \emph{All Pairs Shortest Paths or APSP problem}. \\
|
||||||
|
|
||||||
|
\noindent We can solve the APSP problem either using the fast matrix multiplication or, as I did, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \\
|
||||||
|
|
||||||
|
\noindent A graph is a data structure and we can describe it in different ways. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication requires to consider our graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a generic number if weighted) otherwise. This method requires $O(n^2)$ space in memory, that is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373} \log n)\}$ \texttt{[Zwick 2002; Williams 2012]} \\
|
||||||
|
|
||||||
|
\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It is also been proven that this method can not be improved. In this paper I will propose an exact algorithm to compute the top-$k$ nodes with the higher closeness centrality. I will also discuss an interesting and original relation between the physics of the visualized graph and the nodes with different centrality values.
|
Binary file not shown.
@ -0,0 +1,75 @@
|
|||||||
|
\documentclass{article}
|
||||||
|
\usepackage[utf8]{inputenc}
|
||||||
|
\usepackage{amsthm}
|
||||||
|
\usepackage{amssymb}
|
||||||
|
\usepackage{amsmath}
|
||||||
|
\usepackage{amsfonts}
|
||||||
|
\usepackage{latexsym}
|
||||||
|
\usepackage{graphicx}
|
||||||
|
\usepackage{float}
|
||||||
|
\usepackage{listings}
|
||||||
|
\usepackage{xcolor}
|
||||||
|
\usepackage{imakeidx}
|
||||||
|
\usepackage{algpseudocode}
|
||||||
|
\usepackage{hyperref}
|
||||||
|
|
||||||
|
\newcommand{\N}{\mathbb{N}}
|
||||||
|
\newcommand{\Z}{\mathbb{Z}}
|
||||||
|
\newcommand{\E}{\mathbb{E}}
|
||||||
|
\newcommand{\R}{\mathbb{R}}
|
||||||
|
\newcommand{\LL}{\mathcal{L}}
|
||||||
|
\newcommand{\PP}{\mathcal{P}}
|
||||||
|
\newcommand{\HH}{\mathcal{H}}
|
||||||
|
\newcommand{\KK}{\mathcal{K}}
|
||||||
|
\newcommand{\XX}{\mathcal{X}}
|
||||||
|
\newcommand{\Zm}{\Z/m\Z}
|
||||||
|
\newcommand{\Zn}{\Z/n\Z}
|
||||||
|
\newcommand{\Zp}{\Z_p}
|
||||||
|
\newcommand{\Zmn}{\Z/mn\Z}
|
||||||
|
|
||||||
|
\definecolor{codegreen}{rgb}{0,0.6,0}
|
||||||
|
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
|
||||||
|
\definecolor{codepurple}{rgb}{0.58,0,0.82}
|
||||||
|
\definecolor{backcolour}{rgb}{255,255,255}
|
||||||
|
|
||||||
|
\lstdefinestyle{mystyle}{
|
||||||
|
backgroundcolor=\color{backcolour},
|
||||||
|
commentstyle=\color{codegreen},
|
||||||
|
keywordstyle=\color{magenta},
|
||||||
|
numberstyle=\tiny\color{codegray},
|
||||||
|
stringstyle=\color{codepurple},
|
||||||
|
basicstyle=\ttfamily\footnotesize,
|
||||||
|
breakatwhitespace=false,
|
||||||
|
breaklines=true,
|
||||||
|
captionpos=b,
|
||||||
|
keepspaces=true,
|
||||||
|
numbers=left,
|
||||||
|
numbersep=5pt,
|
||||||
|
showspaces=false,
|
||||||
|
showstringspaces=false,
|
||||||
|
showtabs=false,
|
||||||
|
tabsize=2
|
||||||
|
}
|
||||||
|
\lstset{style=mystyle}
|
||||||
|
|
||||||
|
|
||||||
|
\title{Computing top-k Closeness Centrality Faster in Unweighted Graphs}
|
||||||
|
\author{Luca Lombardo}
|
||||||
|
\date{}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\input{abstract.tex}
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\newpage
|
||||||
|
\tableofcontents{}
|
||||||
|
\include{introduction.tex}
|
||||||
|
\include{algorithm.tex}
|
||||||
|
\include{data.tex}
|
||||||
|
|
||||||
|
|
||||||
|
\end{document}
|
Loading…
Reference in New Issue