tables of cardinalities

3 years ago · c66853dca9
parent 878ae76adb
commit c66853dca9
8 changed files with 46 additions and 29 deletions
--- a/tex/src/algorithm.tex
+++ b/tex/src/algorithm.tex
@ -32,7 +32,7 @@ During the computation of the farness, for each node, we have to compute the dis

    where $\sigma_d$ is the partial sum in \eqref{farness} at the level of exploration $d$. The lower bound \eqref{lower-bound} is updated every time that we change level of exploration during the BFS. In this way, if at a change of level the lower bound of the vertex that we are considering is bigger than the $k-th$ element of \texttt{top}, we can kill the BFS. The reason behind that is very simple: the vector \texttt{top} is populated with the top-k nodes in order and the farness is inversely proportional to the closeness centrality. So if at that level $d$ the lower bound is already bigger than the last element of the vector, there is no need to compute the other level of the BFS since it will not be added in \texttt{top} anyway. \s

-    The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all nodes are reachable form $v$.
+    The \eqref{lower-bound} it's a worst case scenario, and that makes it perfect for a lower bound. If we are at the level $d$ of exploration, we have already computed the sum in \eqref{farness} up to the level $d-1$. Then we need to consider in our computation of the sum the current level of exploration: the worst case gives us that it's linked to all the nodes at distance $d$. We also put $r(v)=n$, in the case that our graph is strongly connected and all nodes are reachable form $v$.
 \end{itemize}

 \textsc{Scrivere pseudocodice}
--- a/tex/src/analysis.tex
+++ b/tex/src/analysis.tex
@ -9,9 +9,9 @@ The first one will tell us how much more efficient the algorithm is in terms of
 \nd The platform for the tests is \emph{a laptop}, so can not be considered precise due factors as thermal throttling. The CPU is an Intel(R) Core™ i7-8750H (6 cores, 12 threads), equipped with 16GB of DDR4 @2666 MHz RAM.

 \subsection{Actors graph} \label{actors-graph}
-Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In the case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation.
+Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In this case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation. \s

-Varying this variable obviously affects the algorithm, in different ways. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors
+\nd Varying this variable obviously affects the algorithm, in different ways. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors.

 \subsubsection{Time of execution} \label{time-actors}

@ -27,18 +27,39 @@ In this section we are going to analyze the performance of the algorithm in func

 \nd In figure \ref{fig:actors_time} it's only taken into consideration the \emph{CPU time} (divided by the number of threads). However, \emph{system time} is in the order of a few minutes in the worst case.

+\subsubsection{Variation of the nodes and edges cardinality}

+Let's analyze how much this filtering effects our data. While varying the variable for the filtering, we are changing the condition for a node to be considered or not.
+
+\begin{table}[h!]
+    \centering
+     \begin{tabular}{||c c c||}
+     \hline
+     \texttt{MIN\textunderscore ACTORS} & Number of nodes & Number of Edges \\ [0.5ex]
+     \hline\hline
+     1 & 923109 & 3202679 \\
+     5 & 126771 & 1949325 \\
+     15 & 37955 & 1251717 \\
+     20 & 26337 & 1056544 \\
+     31 & 13632 & 748580 \\
+     42 & 7921 & 545848 \\ [1ex]
+     \hline
+     \end{tabular}
+    \label{table:actors}
+\end{table}
+
+\nd In \ref{table:actors} we can the exponential growth of both nodes and edges cardinality with lower values of \texttt{MIN\textunderscore ACTORS}. This explains clearly the results obtain in \ref{time-actors}

 \subsubsection{Discrepancy of the results}
 We want to analyze how truthful our results are while varying \texttt{MIN\textunderscore ACTORS}. The methodology is simple: for each results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s

 \nd A way to see this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MIN\textunderscore ACTORS} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MIN\textunderscore ACTORS} set as $i$ and $j$ \s
-\newpage
+
 \nd This analysis is implemented in python using the \texttt{pandas} and \texttt{numpy} libraries.

 \lstinputlisting[language=c++]{code/closeness_analysis.py}

-\nd Visualizing it we obtain the matrix \ref{fig:matrix-a}
+\nd Visualizing it we obtain the matrix in figure \ref{fig:matrix-a}. As expected, it is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MIN\textunderscore ACTORS} the results are more precise. The discrepancy with \texttt{MIN\textunderscore ACTORS=10} is 14\% while being 39\% when \texttt{MIN\textunderscore ACTORS=70}.

 \begin{figure}[h!]
    \centering
@ -46,12 +67,8 @@ We want to analyze how truthful our results are while varying \texttt{MIN\textun
    \caption{Discrepancy of the results on the actors graph in function of the minimum number of movies required to be considered as a node}
    \label{fig:matrix-a}
 \end{figure}
-
-\nd As expected, the matrix \ref{fig:matrix-a} is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MIN\textunderscore ACTORS} the results are more precise. The discrepancy with \texttt{MIN\textunderscore ACTORS=10} is 14\% while being 39\% when \texttt{MIN\textunderscore ACTORS=70}. \s
-
-\nd This is what we obtain confronting the top-k results when $k=100$. It's interesting to se how much the discrepancy changes with different values of $k$. However, choosing a lower value for $k$ would not be useful for this type of analysis. Since we are looking at the not common elements of two lists, with a small length, we would get results biased by statistical straggling. \s
-
-\textsc{Per fare i test con altri valori di k ho bisogno di un server. Da decidere}
+\s
+\nd This is what we obtain confronting the top-k results when $k=100$. It cloud be interesting to see how much the discrepancy changes with different values of $k$. However, choosing a lower value for $k$ would not be useful for this type of analysis. Since we are looking at the not common elements of two lists, with a small length, we would get results biased by statistical straggling. \s

 \s
 \newpage
--- a/tex/src/code.tex
+++ b/tex/src/code.tex
@ -22,7 +22,7 @@ Now we can use the file \texttt{Relazioni.txt}. As before, we loop on all the el
    \item \texttt{id\textunderscore attore}: index key of each actor
 \end{itemize}

-\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we update the list of indices of actors/actresses that played in the movies with that id.
+\nd If they both exists, we update the list of indices of movies that the actor/actresses played in. In the same way, we update the list of indices of actors/actresses that played in the movie with that id.

 \lstinputlisting[language=c++]{code/graph.cpp}
 \s
--- a/tex/src/data.tex
+++ b/tex/src/data.tex
@ -4,14 +4,14 @@ The algorithm shown before can be applied to any dataset on which is possibile t
 \subsection{Data Structure}
 All the data used can be downloaded here: \url{https://datasets.imdbws.com/} \s

-\noindent In particolar we're interest in 3 files
+\noindent In particolar we're interest in 4 files
 \begin{itemize}
    \item \texttt{title.basics.tsv}
    \item \texttt{title.principals.tsv}
    \item \texttt{name.basics.tsv}
    \item \texttt{title.ratings.tsv}
 \end{itemize}
-Let's have a closer look to this 4 files:
+Let's have a closer look at this 4 files:

 \subsubsection*{title.basics.tsv}
 \emph{Contains the following information for titles:}
@ -62,7 +62,7 @@ Let's have a closer look to this 4 files:

 This is a crucial section for the algorithm in this particolar case study. This raw data contains a huge amount of un-useful information that will just have a negative impact on the performance during the computation. We are going to see in detail all the modification made for each file. All this operation have been implemented using \texttt{python} and the \texttt{pandas} library. \s

-\nd Since we want to build two different graph, some consideration will have to be made for the specific case. If nothing is told it means that the filtering of that file is the same for both graphs.
+\nd Since we want to build two different graphs, some consideration will have to be made each specific case. If nothing is told it means that the filtering of that file is the same for both graphs.

 \subsubsection{name.basics.tsv}

@ -88,9 +88,9 @@ For this file we only need the following columns
    \item \texttt{isAdult}
    \item \texttt{titleType}
 \end{itemize}
-Since all the movies starts with the string \texttt{t0} we can remove it to clean the output. In this case, we also want to remove all the movies for adults. This part can be optional if we are interest only in the closeness and harmonic centrality. Even if the actors and actresses of the adult industry use to make a lot of movies together, this won't alter the centrality result. As we know, an higher closeness centrality can be seen as the ability of a node to spread efficiently information in the network. Including the adult industry would lead to the creation of a very dense and isolated neighborhood. But none of those nodes will have an higher closeness centrality because they only spread information in their community. This phenomenon will be discussed more deeply in the analysis of the graph visualized. \s
+Since all the movies starts with the string \texttt{t0} we can remove it to clean the output. In this case, we also want to remove all the movies for adults. This part can be optional if we are interest only in the closeness and harmonic centrality. Even if the actors and actresses of the adult industry use to make a lot of movies together, this won't alter the centrality result. As we know, an higher closeness centrality can be seen as the ability of a node to spread efficiently information in the network. Including the adult industry would lead to the creation of a very dense and isolated neighborhood. But none of those nodes will have an higher closeness centrality because they only spread information in their community. This phenomenon will be discussed more deeply in the analysis of the graph visualized in section \ref{Visualization of the graphs}. \s

-\noindent We can also notice that there is a lot of \emph{junk} in IMDb. To avoid dealing with un-useful data, we are considering all the non-adult movies in this whitelist
+\noindent We can also notice that there is a lot of \emph{junk} in IMDb. To avoid dealing with un-useful data, we can consider all the non-adult movies in this whitelist

 \begin{itemize}
    \item \texttt{movie}
@ -104,7 +104,7 @@ The reason to only consider this categories is purely to optimize the performanc

 \subsubsection{title.principals.tsv}

-This file is needed for the analysis of both graphs, but there some different observation between them. For the both we only need the following columns
+This file is needed for the analysis of both graphs, but there some different observation between them. For both cases we only need the following columns

 \begin{itemize}
    \item \texttt{tconst}
@ -112,12 +112,12 @@ This file is needed for the analysis of both graphs, but there some different ob
    \item \texttt{category}
 \end{itemize}

-\noindent As before, we clean the output removing unnecessary strings. \s
+\noindent As done for the previous files, we clean the output removing unnecessary strings in \texttt{tconst} and \texttt{nconst}. \s

 \textsc{Actors Graph}
 \s

-\noindent Using the data obtained  before we create an array of unique actor ids (\texttt{nconst}) and an array of how may times they appear (\texttt{counts}). This will give us the number of movies they appear in. And here it comes the core of the optimization for this graph. Let's define a constant \texttt{MINMOVIES}. This integer is the minimum number of movies that an actor needs to have made in his carrier to be considered in this graph. The reason to do that it's purely computational. If an actor/actress has less then a reasonable number of movies made in his carrier, there is an high probability that he/she has an important role in our graph during the computation of the centralities. \s
+\noindent Using the data obtained  before we create an array of unique actor ids (\texttt{nconst}) and an array of how may times they appear (\texttt{counts}). This will give us the number of movies they appear in. And here it comes the core of the optimization for this graph. Let's define a constant \texttt{MIN\textunderscore ACTORS}. This integer is the minimum number of movies that an actor needs to have made in his carrier to be considered in this graph. The reason to do that it's purely computational. If an actor/actress has less then a reasonable number of movies made in his carrier, there is a low probability that he/she cloud have an important role in our graph during the computation of the centralities. \s

 \textsc{Movies Graph} \s

@ -134,6 +134,6 @@ This file is necessary just in the analysis of the movie graph, it won't be even
    \item \texttt{numVotes}
 \end{itemize}

-\nd The idea behind the optimization made in this file is the same that we have used before with the \texttt{MINMOVIES} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movie has received on the IMDB website. To do that we introduce the constant \texttt{VOTES}, considering only the movies with an higher number of votes. During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s
+\nd The idea behind the optimization made in this file is the same that we have used before with the \texttt{MIN\textunderscore ACTORS} technique. We want to avoid computing movies that are not central with an high probability. To do that we consider the number of votes that each movie has received on the IMDB website. To do that we introduce the constant \texttt{VOTES}, considering only the movies with an higher number of votes. During the analysis we will change this value to see how it effects the list of the top-k most central movies. \s

 \nd In this case we don't have to generate a new file, we can apply this condition to \texttt{FilmFiltrati.txt}
--- a/tex/src/improvement.tex
+++ b/tex/src/improvement.tex
@ -1,4 +1,4 @@
-\section{Further Work: Harmonic centrality}
+\section{Harmonic centrality}

 The algorithm shown in this paper is very versatile. We have tested it with two different graphs and obtained excellent results. But there could be more.

--- a/tex/src/introduction.tex
+++ b/tex/src/introduction.tex
@ -7,16 +7,16 @@ A graph $G= (V,E)$ is a pair of a sets. Where $V = \{v_1,...,v_n\}$ is the set o
   c(v) = \frac{n-1}{\displaystyle \sum_{w \in V} d(v,w)}
 \end{equation}

-\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ vertices with the higher closeness centrality. \s
+\nd where $n$ is the cardinality of $V$ and $d(v,w)$ is the distance between $v,w \in V$. This is a very powerful tool for the analysis of a network: it ranks each node telling us the most efficient ones in spreading information through all the other nodes in the graph. As mentioned before, the denominator of this definition give us the length of the shortest path between two nodes. This means that for a node to be central, the average number of links needed to reach another node has to be low. The goal of this paper is to computer the $k$ nodes with the higher closeness centrality. \s

 \noindent As case study we are using the collaboration network in the \emph{Internet Movie Database} (IMDB).  We are going to consider two different graphs. For the first one we define an undirected graph $G=(V,E)$ where
 \begin{itemize}
-    \item The vertices $V$ are the actor and the actress
-    \item The non oriented edges in $E$ links the actors and the actresses if they played together in a movie.
+    \item The nodes $V$ are the actors and the actress
+    \item The non oriented edges in $E$ links two nodes if they played together in a movie.
 \end{itemize}
 For the second one we do the opposite thing: we define an undirected graph $G=(V,E)$ where
 \begin{itemize}
-    \item the vertices $V$ are the movies
+    \item the nodes $V$ are the movies
    \item the non oriented edges in $E$ links two movies if they have an actor or actress in common.
 \end{itemize}

@ -27,6 +27,6 @@ Since we are dealing with a web-scale network any brute force algorithm would re

 \noindent We can solve the APSP problem either using the fast matrix multiplication or, as made in this paper, implementing a breath-first-search (BFS) method. There are several reason to prefer this second approach over the first one in this type of problems. \s

-\noindent A graph is a data structure and we can describe it in different ways. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a generic number if weighted) otherwise. This method requires $O(n^2)$ space in memory, that is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373} \log n)\}$ \texttt{[Zwick 2002; Williams 2012]}  \s
+\noindent A graph is a data structure and we can describe it in different ways. Choosing one over another can have an enormous impact on performance. In this case, we need to remember the type of graph that we are dealing with: a very big and sparse one. The fast matrix multiplication implement the graph as an $n\times n$ matrix where the position $(i,j)$ is zero if the nodes $i,j$ are not linked, 1 (or a generic number if weighted) otherwise. This method requires $O(n^2)$ space in memory. That is an enormous quantity on a web-scale graph. Furthermore the time complexity is $O(n^{2.373} \log n)\}$ \texttt{[Zwick 2002; Williams 2012]}  \s

-\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It is also been proven that this method can not be improved. In this paper I propose an exact algorithm to compute the top-$k$ nodes with the higher closeness centrality.
+\noindent Using the BFS method the space complexity is $O(n+m)$, which is a very lower value compared to the previous method. In terms of time, the complexity is $O(nm)$. Unfortunately, this is not enough to compute all the distances in a reasonable time. It has also been proven that this method can not be improved. In this paper I propose an exact algorithm to compute the top-$k$ nodes with the higher closeness centrality.
--- a/tex/src/main.pdf
+++ b/tex/src/main.pdf
--- a/tex/src/visualization.tex
+++ b/tex/src/visualization.tex
@ -1,4 +1,4 @@
-\section{Visualization of the graphs}
+\section{Visualization of the graphs} \label{Visualization of the graphs}
 Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. Since we are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page. \s

 \nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s