visualiztion section started

3 years ago · 7ad5e0a366
parent 329a1f84bc
commit 7ad5e0a366
8 changed files with 60 additions and 4 deletions
--- a/tex/Figure_2.png
+++ b/tex/Figure_2.png
--- a/tex/Screenshot.png
+++ b/tex/Screenshot.png
--- a/tex/analysis.tex
+++ b/tex/analysis.tex
@ -8,7 +8,7 @@ The first one will tell us how much more efficient the algorithm is in terms of

 \nd The platform for the tests is \emph{a laptop}, so can not be considered precise due factors as thermal throttling. The CPU is an Intel(R) Core™ i7-8750H (6 cores, 12 threads), equipped with 16GB of DDR4 @2666 MHz RAM.

-\subsection{Actors graph}
+\subsection{Actors graph} \label{actors-graph}
 Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In the case, during the filtering, we created the variable \texttt{MIN\textunderscore ACTORS}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation.

 Varying this variable obviously affects the algorithm, in different way. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MIN\textunderscore ACTORS} affects this two factors
@ -28,8 +28,9 @@ We want to analyze how truthful our results are while varying \texttt{MIN\textun

 \nd Visualizing this analysis we obtain this

-\begin{figure}[h]
-    \includegraphics[width=13cm]{Figure_1.png}
+\begin{figure}[h] \label{matrix-a}
+    \includegraphics[width=12cm]{Figure_1.png}
+    \caption{Discrepancy of the results on the actors graph in function of the minimum number of movies required to be considered as a node}
 \end{figure}

 \nd As expected, the matrix is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MIN\textunderscore ACTORS} the results are more precise. The discrepancy with \texttt{MIN\textunderscore ACTORS=10} is 14\% while being 39\% when \texttt{MIN\textunderscore ACTORS=70}. \s
@ -37,3 +38,24 @@ We want to analyze how truthful our results are while varying \texttt{MIN\textun
 \nd This is what we obtain confronting the top-k results when $k=100$. It's interesting to se how much the discrepancy change with different values of $k$. However, choosing a lower value for $k$ would not be useful for this type of analysis. Since we are looking at the not common elements of two lists, with a small length, we would get results biased by statistical straggling. \s

 \textsc{Da fare: test con con k=500 e k=1000}
+
+\s
+\newpage
+\subsection{Movies Graphs}
+In this section we are taking into consideration the graph build over the movies and their common actors/actresses. Due to an elevated number of nodes, to optimize the performance during the execution in the section \ref{filtering} we introduced the variable \texttt{VOTES}. It represents the minimum number of votes (indifferently is positive or negative) that a movie need to have on the IMDb database to be considered as a node in our graph.
+
+As seen during the analysis of the actors graph in \ref{actors-graph}, varying this kind of variables affects the results in many ways. All the observations made before are still valid for this case, I won't repeat them for shortness. As done before (\ref{matrix-a}), we are going to use a matrix to visualize and analyze the results
+\s
+
+% \lstinputlisting[language=c++]{code/closeness_analysis_2.py}
+
+\nd Giving us:
+\begin{figure}[H] \label{matrix-b}
+    \centering
+    \includegraphics[width=13cm]{Figure_2.png}
+    \caption{Discrepancy of the results on the movie graph in function of the minimum number of votes required to be considered as a node}
+\end{figure}
+\newpage
+\lstinputlisting[language=c++]{code/closeness_analysis_2.py}
+
+\s \nd \emph{Dire qualcosa sull'analisi, ma andrebbe rifatta perché i valori non vanno bene}
--- a/tex/code/closeness_analysis_2.py
+++ b/tex/code/closeness_analysis_2.py
@ -0,0 +1,10 @@
+dfs = {
+    i: pd.read_csv(f"top_movies_{i:02d}_c.txt", sep='\t', usecols=[1], names=["movie"])
+    for i in [500, 1000, 5000, 10000, 25000, 50000, 75000, 100000]}
+sets = {i: set(df["movie"]) for i, df in dfs.items()}
+
+diff = []
+for i in sets.keys():
+    diff.append([len(sets[i]) - len(sets[i] & sets[j]) for j in sets.keys()])
+diff = np.array(diff, dtype=float)
+diff /= len(next(iter(sets.values())))
--- a/tex/data.tex
+++ b/tex/data.tex
@ -58,7 +58,7 @@ Let's have a closer look to this 4 files:
 \end{itemize}

 \newpage
-\subsection{Filtering}
+\subsection{Filtering} \label{filtering}

 This is a crucial section for the algorithm in this particolar case study. This raw data contains a huge amount of un-useful information that will just have a negative impact on the performance during the computation. We are going to see in detail all the modification made for each file. All this operation have been implemented using \texttt{python} and the \texttt{pandas} library. \s

--- a/tex/main.pdf
+++ b/tex/main.pdf
--- a/tex/main.tex
+++ b/tex/main.tex
@ -77,5 +77,6 @@
 \include{data.tex}
 \include{code.tex}
 \include{analysis.tex}
+\include{visualization.tex}

 \end{document}
--- a/tex/visualization.tex
+++ b/tex/visualization.tex
@ -0,0 +1,23 @@
+\section{Visualization of the graphs}
+Graphs are fascinating structures, visualizing them can give us a more deep understanding of their proprieties. To do that we need to make some sacrifices. We are dealing with millions of nodes, displaying them all would be impossibile, especially on a web page as I did. \s
+
+\nd For each case we need to find a small (in the order of 1000) subset of nodes $S \subset V$ that we want to display. It's important to take into consideration, as far as we can, nodes that are "important" in the graph \s
+
+\nd All this section is implemented in python using the library \texttt{pyvis}. The goal of this library is to build a python based approach to constructing and visualizing network graphs in the same space. A pyvis network can be customized on a per node or per edge basis. Nodes can be given colors, sizes, labels, and other metadata. Each graph can be interacted with, allowing the dragging, hovering, and selection of nodes and edges. Each graph's layout algorithm can be tweaked as well to allow experimentation with rendering of larger graphs. It is designed as a wrapper around the popular Javascript \texttt{visJS} library
+
+\subsection{Actors Graph}
+For the actors graph we choose the subset $S$ as the actors with at least 100 movies made in their carrier. We can immediately deduct that this subset will be characterized by actors and actresses of a certain age. But as we have seen, having an high number of movies made it's a good estimator for the closeness centrality. It's important to keep in mind that the graph will only show the relations nodes in this subset. This means that even if an actor has 100 movies made in his carrier, in this graph may have just a few relations. We can see this graph as collaboration network between the most popular actors and actresses. \s
+
+\nd An interactive version can be found at this web page. It will take a few seconds to render, it's better to use a computer and not a smartphone. \s
+
+\textsc{Interactive version}: \url{https://lukefleed.xyz/imdb-graph.html}
+
+\begin{center}
+    \s \nd \qrcode{https://lukefleed.xyz/imdb-graph.html}
+\end{center}
+
+\begin{figure}[H] \label{imdb-a-network}
+    \centering
+    \includegraphics[width=13cm]{Screenshot.png}
+    \caption{The collaboration network of the actors and actresses with more that an 100 movies on the IMDb network}
+\end{figure}