tex files for the analysis

3 years ago · d27d53e662
parent 31b0ed8117
commit d27d53e662
3 changed files with 46 additions and 0 deletions
--- a/tex/Figure_1.png
+++ b/tex/Figure_1.png
--- a/tex/analysis.tex
+++ b/tex/analysis.tex
@ -0,0 +1,35 @@
+\section{Analysis of the results}
+In this section we are going to discuss the results of the top-k algorithm applied to the IMDb graphs. We are particularly interested in two factors:
+\begin{itemize}
+    \item The time needed to for the execution in function of different filtering values.
+    \item The discrepancy on the results while varying the filtering values
+\end{itemize}
+The first one will tell us how much more efficient the algorithm is in terms of time, independently from the results. The second one is the metric to understand how accurate the filtered algorithm is. It's clear that even if we can compute the algorithm 100 times faster, it's of no use if the results are completely different from the real ones.\s
+
+\nd The platform for the tests is \emph{a laptop}, so can not be considered precise due factors as thermal throttling. The CPU is an Intel(R) Core™ i7-8750H (6 cores, 12 threads), equipped with 16GB of DDR4 @2666 MHz RAM.
+
+\subsection{Actors graph}
+Let's take into analysis the graph were each actors is a node and two nodes are linked the if they played in a movie together. In the case, during the filtering, we created the variable \texttt{MINMOVIES}. This variable is the minimun number of movies that an actor/actress has to have done to be considered in the computation.
+
+Varying this variable obviously affects the algorithm, in different way. The higher this variable is, the less actors we are taking into consideration. So, with a smaller graph, we are expecting better results in terms of time execution. On the other hand, we also can expect to have less accurate results. What we are going to discuss is how much changing \texttt{MINMOVIES} affects this two factors
+
+\subsubsection{Time of execution}
+
+TO DO
+
+\subsubsection{Discrepancy of the results}
+We want to analyze how truthful our results are while varying MINMOVIES. The methodology is simple: for each results (lists) we take the intersection of the two. This will return the number of elements in common. Knowing the length of the lists, we can find the number of elements not in common. \s
+
+\nd A way to see this results is with a square matrix $n \times n, ~ A = (a_{ij})$, where $n$ is the number of different values that we gave to \texttt{MINMOVIES} during the testing. In this way the $(i,j)$ position is the percentage of discrepancy between the results with \texttt{MINMOVIES} set as $i$ and $j$ \s
+
+\nd This analysis is implemented in python using the \texttt{pandas} and \texttt{numpy} libraries.
+
+\lstinputlisting[language=c++]{code/closeness_analysis.py}
+
+\nd Visualizing this analysis we obtain this
+
+\begin{figure}[h]
+    \includegraphics[width=13cm]{Figure_1.png}
+\end{figure}
+
+\nd As expected, the matrix is symmetrical and the elements on the diagonal are all equal to zero. We can see clearly that with a lower value of \texttt{MINMOVIES} the results are more precise. The discrepancy with \texttt{MINMOVIES=10} is 14\% while being 39\% when \texttt{MINMOVIES=70}.
--- a/tex/code/closeness_analysis.py
+++ b/tex/code/closeness_analysis.py
@ -0,0 +1,11 @@
+dfs = {
+    i: pd.read_csv(f"top_actors_{i:02d}_c.txt", sep='\t', usecols=[1], names=["actor"])
+    for i in [5] + list(range(10, 71, 10))}
+sets = {i: set(df["actor"]) for i, df in dfs.items()}
+
+diff = []
+for i in sets.keys():
+    diff.append([len(sets[i]) - len(sets[i] & sets[j]) for j in sets.keys()])
+diff = np.array(diff, dtype=float)
+diff /= len(next(iter(sets.values())))
+