last sections revision

main
Luca Lombardo 3 years ago
parent c326c851ad
commit ec3026938b

@ -1 +0,0 @@
src/main.pdf

@ -1,6 +1,6 @@
\section{The algorithm}
In a connected graph, given a node $v \in V$, we can define its farness as
In a connected graph, given a node $v \in V$, we can define its farness as:
\begin{equation}
f(v) = \frac{1}{c(v)} = \frac{1}{n-1} \displaystyle \sum_{w \in V} d(v,w)
@ -18,9 +18,9 @@ where $r(v) = |R(v)|$ is the cardinality of the set of reachable nodes from $v$.
With the convention that in a case of $\frac{0}{0}$ we set the closeness of $v$ to 0
\subsection{The lower bound technique}
During the computation of the farness, for each node, we have to compute the distance from that node and to all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$. \s
During the computation of the farness, for each node, we have to compute the distance from that node and to all the other ones reachable from it. Since we are dealing with millions of nodes, it's not possibile in a reasonable time. In order to compute only the top-$k$ most central node we need to find a way to avoid computing BFS for nodes that won't be in the top-$k$ list. \s
\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. If the lower bound tell us that the node will not be in the top-$k$, this will allow us to kill the BFS operation before it reaches the end. More precisely:
\noindent The idea is to keep track of a lower bound on the farness for each node that we will compute. If at some point the lower bound tell us that the node will not be in the top-$k$, this will allow us to kill the BFS operation before it reaches the end. More precisely:
\begin{itemize}
\item The algorithm will compute the farness of the first $k$ nodes, saving them in a vector \texttt{top}. From now on, this vector will be full.
@ -30,7 +30,7 @@ During the computation of the farness, for each node, we have to compute the dis
\frac{n-1}{(n-1)^2} (\sigma_{d-1} + n_d \cdot d)
\end{equation}
where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level $d$ the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top} anyway. \s
where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level $d$ the lower bound is already bigger than the last element of the vector, there is no need to compute the other levels of the BFS since it will not be added in \texttt{top} anyway. \s
The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need to consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all nodes are reachable form $v$.
\end{itemize}

@ -49,12 +49,12 @@ Let's analyze how much this filtering effects our data. While varying the variab
\label{table:actors}
\end{table}
\nd In \ref{table:actors} we can the exponential growth of both nodes and edges cardinality with lower values of \texttt{MIN\textunderscore ACTORS}. This explains clearly the results obtain in \ref{time-actors}
\nd In \ref{table:actors} we can see the exponential growth of both nodes and edges cardinality with lower values of \texttt{MIN\textunderscore ACTORS}. This explains clearly the results obtain in \ref{time-actors}
\subsubsection{Discrepancy of the results}
We want to analyze how truthful our results are while varying \texttt{MIN\textunderscore ACTORS}. The methodology is simple: for each results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s
We want to analyze how truthful our results are while varying \texttt{MIN\textunderscore ACTORS}. The methodology is simple: for each couple of results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s
\nd A way to see this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MIN\textunderscore ACTORS} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MIN\textunderscore ACTORS} set as $i$ and $j$ \s
\nd An easy way to visualize this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MIN\textunderscore ACTORS} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MIN\textunderscore ACTORS} set as $i$ and $j$ \s
\nd This analysis is implemented in python using the \texttt{pandas} and \texttt{numpy} libraries.
@ -94,7 +94,7 @@ As seen in \ref{time-actors} we are going to analyze the performance of the algo
\subsubsection{Discrepancy of the results}
All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{fig:matrix-a}), we are going to use a matrix to visualize and analyze the results
All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{fig:matrix-a}), we are going to use a matrix to visualize and analyze the results.
\s
% \lstinputlisting[language=c++]{code/closeness_analysis_2.py}

@ -2,8 +2,8 @@
In this paper we discussed the results of an exact algorithm for the computation of the $k$ most central nodes in a graph, according to closeness centrality. We saw that with the introduction of a lower bound, the real word performance are way better than a brute force algorithm that compute all the \texttt{BFS}. \s
\nd Since there were no server with dozens of threads ad hundreds of Gigs of RAM to test the algorithm with, every thing has been adapted knowing that everything needed to be made on a laptop. \s
\nd Since there were no server with dozens of threads and hundreds of Gigs of RAM, every idea has been implement knowing that it needed to run fine on a regolar laptop. This condition lead to interesting implementations for the filtering on the raw data. \s
\nd We have seen two different case studies, both based on the IMDb network. For each of them we had to find a way to filter the data without loosing accuracy on the results. We saw that with an harder filtering, we gain a lot of performance, but the results showed an increasing discrepancy from the reality. Analyzing this test made we were able to find, for both graphs, a balance that gives accuracy and performance at the same time.
\nd We have seen two different case studies, both based on the IMDb network. For each of them we had to find a way to filter the data without loosing accuracy on the results. We saw that with an harder filtering, we gain a lot of performance, but the results showed an increasing discrepancy from the reality. Analyzing those tests made, we have been able to find, for both graphs, a balance that gives accuracy and performance at the same time.
\s \nd This work is heavily based on \cite{DBLP:journals/corr/BergaminiBCMM17}. Even if this article use a more complex and complete approach, the results on the IMDb case study are almost identical. They worked with snapshot, analyzing single time periods, so there are some inevitable discrepancies. Despite that, most of the top-$k$ actors are the same and the closeness centrality values are very similar. We can use this comparison to attest the truthfulness and efficiency of the algorithm presented in this paper.

16
tex/src/data.tex vendored

@ -4,7 +4,7 @@ The algorithm shown before can be applied to any dataset on which is possibile t
\subsection{Data Structure}
All the data used can be downloaded here: \url{https://datasets.imdbws.com/} \s
\noindent In particolar we're interest in 4 files
\noindent In particular, we're interest in 4 files
\begin{itemize}
\item \texttt{title.basics.tsv}
\item \texttt{title.principals.tsv}
@ -62,7 +62,7 @@ Let's have a closer look at this 4 files:
This is a crucial section for the algorithm in this particolar case study. This raw data contains a huge amount of un-useful information that will just have a negative impact on the performance during the computation. We are going to see in detail all the modification made for each file. All this operation have been implemented using \texttt{python} and the \texttt{pandas} library. \s
\nd Since we want to build two different graphs, some consideration will have to be made each specific case. If nothing is told it means that the filtering of that file is the same for both graphs.
\nd Since we want to build two different graphs, some consideration will have to be made for each specific case. If nothing is told it means that the filtering of that file is the same for both graphs.
\subsubsection{name.basics.tsv}
@ -90,7 +90,7 @@ For this file we only need the following columns
\end{itemize}
Since all the movies starts with the string \texttt{t0} we can remove it to clean the output. In this case, we also want to remove all the movies for adults. This part can be optional if we are interest only in the closeness and harmonic centrality. Even if the actors and actresses of the adult industry use to make a lot of movies together, this won't alter the centrality result. As we know, an higher closeness centrality can be seen as the ability of a node to spread efficiently information in the network. Including the adult industry would lead to the creation of a very dense and isolated neighborhood. But none of those nodes will have an higher closeness centrality because they only spread information in their community. This phenomenon will be discussed more deeply in the analysis of the graph visualized in section \ref{Visualization of the graphs}. \s
\noindent We can also notice that there is a lot of \emph{junk} in IMDb. To avoid dealing with un-useful data, we can consider all the non-adult movies in this whitelist
\noindent We can also notice that there is a lot of \emph{junk} in IMDb. To avoid dealing with un-useful data, we can consider all the non-adult movies in this whitelist of categories:
\begin{itemize}
\item \texttt{movie}
@ -98,13 +98,13 @@ Since all the movies starts with the string \texttt{t0} we can remove it to clea
\item \texttt{tvMovie}
\item \texttt{tvMiniSeries}
\end{itemize}
The reason to only consider this categories is purely to optimize the performance during the computation. On IMDb each episode is listed as a single element: to remove them without loosing the most important relations, we only consider the category \texttt{tvSeries}. This category lists a TV-Series as a single element, not divided in multiple episodes. In this way we will loose some of the relations with minor actors that may appear in just a few episodes. But we will have preserved the relations between the protagonists of the show. \s
The reason to only consider this ones is purely to optimize the performance during the computation. On IMDb each episode is listed as a single element: to remove them without loosing the most important relations, we only consider the category \texttt{tvSeries}. This category lists a TV-Series as a single element, not divided in multiple episodes. In this way we will loose some of the relations with minor actors that may appear in just a few episodes. But we will have preserved the relations between the protagonists of the show. \s
\noindent Then we can generate the final filtered file \texttt{FilmFiltrati.txt} that has only two columns: \texttt{tconst} and \texttt{primaryTitle}
\subsubsection{title.principals.tsv}
This file is needed for the analysis of both graphs, but there some different observation between them. For both cases we only need the following columns
This file is needed for the analysis of both graphs, but there are some specific diversifications that have to be made. For both cases we only need the following columns:
\begin{itemize}
\item \texttt{tconst}
@ -112,7 +112,7 @@ This file is needed for the analysis of both graphs, but there some different ob
\item \texttt{category}
\end{itemize}
\noindent As done for the previous files, we clean the output removing unnecessary strings in \texttt{tconst} and \texttt{nconst}. \s
\noindent As done for the previous files, we clean the output removing unnecessary strings in \texttt{tconst} and \texttt{nconst}. Let's now make two different filtering for each case.\s
\textsc{Actors Graph}
\s
@ -123,7 +123,7 @@ This file is needed for the analysis of both graphs, but there some different ob
\noindent For this graph we don't need any optimization on this file. We just clean clean the output and leave the rest as it is. \s
\nd At the end, for both graph, we can finally generate the file \texttt{Relazioni.txt} containing the columns \texttt{tconst} and \texttt{nconst}
\nd At the end, for both graph, we can finally generate the file \texttt{Relazioni.txt} containing the columns \texttt{tconst} and \texttt{nconst}.
\subsubsection{title.ratings.tsv}
@ -134,6 +134,6 @@ This file is necessary just in the analysis of the movie graph, it won't be even
\item \texttt{numVotes}
\end{itemize}
\nd The idea behind the optimization made in this file is the same that we have used before with the \texttt{MIN\textunderscore ACTORS} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movie has received on the IMDB website. To do that we introduce the constant \texttt{VOTES}, considering only the movies with an higher number of votes. During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s
\nd The idea behind the optimization made in this file is similar to the one that we have used before with the \texttt{MIN\textunderscore ACTORS} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movie has received on the IMDb website. To do that we introduce the constant \texttt{VOTES}. It defines the minimum number of votes that a movie needs to have on the IMDb platform to be considered in the analysis. During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s
\nd In this case we don't have to generate a new file, we can apply this condition to \texttt{FilmFiltrati.txt}

@ -14,7 +14,7 @@ The algorithm shown in this paper is very versatile. We have tested it with two
h(v) \leq U_B (v) \leq h(w)
\end{equation}
\nd A possibile lower bound can be taken considering the worst case that could happen at each state
\nd A possibile upper bound can be taken considering the worst case that could happen at each state
\begin{equation}
U_b (v) = \sigma_{d-1} + \frac{n_d}{d} + \frac{n - r - n_d}{d+1}
@ -22,4 +22,4 @@ The algorithm shown in this paper is very versatile. We have tested it with two
\nd When we are at the level $d$ of our exploration, we already know the partial sum $\sigma_{d-1}$. The worst case in this level happens when the node $v$ is connected to all the other nodes. To consider this possibility we add the factors $\frac{n_d}{d} + \frac{n - r - n_d}{d+1}$.
\s \nd This method has been tested and works with excellent results. What needs to be adjusted is a formal normalization for the harmonic centrality and for the upper bound. In the Github repository, the script already gives the possibility to compute the top-k harmonic centrality of both graphs
\s \nd This method has been tested and works with excellent results. What needs to be adjusted is a formal normalization for the harmonic centrality and for the upper bound. In the Github repository, the script already gives the possibility to compute the top-k harmonic centrality of both graphs.

Binary file not shown.
Loading…
Cancel
Save