small fixes to the code and the tex

main
Luca Lombardo 2 years ago
parent d91c796b57
commit 2df71072a6

@ -145,6 +145,8 @@
"\n",
" # reshape y to be a column vector\n",
" y = y.reshape(y.shape[0],1)\n",
"\n",
" # update x \n",
" x += V[:,0:y.shape[0]] @ y\n",
"\n",
" # compute the residual vector\n",
@ -160,7 +162,7 @@
" A_tmp = sp.sparse.hstack([H, z]) # stack H and z, as in the paper, to solve the linear system (?)\n",
" A_tmp = A_tmp.tocsc() # Convert A to CSC format for sparse solver\n",
"\n",
" # What should I put here? What does it mean in the paper the 14 of the pseudocode?\n",
" # What should I put here? What does it mean in the paper the line 14 of the pseudocode?\n",
" result = sp.sparse.linalg.spsolve(A_tmp, np.zeros(A_tmp.shape[0])) # if I solve this, I get a vector of zeros.\n",
" print(result)\n",
" \n",
@ -189,7 +191,6 @@
"metadata": {},
"outputs": [],
"source": [
"\n",
"n = 100\n",
"m = 110\n",
"maxit = 100\n",

@ -233,6 +233,20 @@ class Algorithms:
# TO DO
pass
class runners:
def ShiftedPowerMethod(tau):
dataset = Utilities.load_data()
max_mv = 100
G, n = Utilities.create_graph(dataset)
P = Utilities.create_matrix(G)
d = Utilities.dangling_nodes(P,n)
v = Utilities.probability_vector(n)
Pt = Utilities.transition_matrix(P, v, d)
a = Utilities.alpha()
mv, x, r, total_time = Algorithms.algo1(Pt, v, tau, max_mv, a)
print("total time = ", total_time)

Binary file not shown.

78
tex/main.tex vendored

@ -35,15 +35,16 @@ for solving PageRank with multiple damping factors}
\maketitle
\begin{abstract}
Starting from the seminal paper published by Brin and Page in 1998, the PageRank model has been extended to many fields far beyond search engine rankings, such as chemistry, biology, bioinformatics, social network analysis, to name a few. Due to the large dimension of PageRank problems, in the past decade or so, considerable research efforts have been devoted to their efficient solution especially for the difficult cases where the damping factors are close to 1. However, there exists few research work concerning about the solution of the case where several PageRank problems with the same network structure and various damping factors need to be solved. In this paper, we generalize the Power method to solving the PageRank problem with multiple damping factors. We demonstrate that the solution has almost the equative cost of solving the most difficult PageRank system of the sequence, and the residual vectors of the PageRank systems after running this method are collinear. Based upon these results, we develop a more efficient method that combines this Power method with the shifted GMRES method. For further accelerating the solving phase, we present a seed system choosing strategy combined with an extrapolation technique, and analyze their effect. Numerical experiments demonstrate the potential of the proposed iterative solver for accelerating realistic PageRank computations with multiple damping factors.
\noindent In the years following its publication in 1968, the PageRank model has been studied deeply to be extended in fields such as chemistry, biology and social network analysis. The aim of this project is the implementation of a modified version of the Power method to solve the PageRank problem with multiple damping factors. The proposed method is based on the combination of the Power method with the shifted \texttt{GMRES} method.
% During the years since it's first version, the PageRank model has been extended to many fields far beyond search engine rankings, such as chemistry, biology, bioinformatics, social network analysis, to name a few. Due to the large dimension of PageRank problems, in the past decade or so, considerable research efforts have been devoted to their efficient solution especially for the difficult cases where the damping factors are close to 1. However, there exists few research work concerning about the solution of the case where several PageRank problems with the same network structure and various damping factors need to be solved. In this paper, we generalize the Power method to solving the PageRank problem with multiple damping factors. We demonstrate that the solution has almost the equative cost of solving the most difficult PageRank system of the sequence, and the residual vectors of the PageRank systems after running this method are collinear. Based upon these results, we develop a more efficient method that combines this Power method with the shifted GMRES method. For further accelerating the solving phase, we present a seed system choosing strategy combined with an extrapolation technique, and analyze their effect. Numerical experiments demonstrate the potential of the proposed iterative solver for accelerating realistic PageRank computations with multiple damping factors.
\end{abstract}
\tableofcontents
\clearpage
\section{Introduction}
The PageRank model was proposed by Google in a series of papers to evaluate accurately the most important web-pages from the World Wide Web matching a set of keywords entered by a user. Nowadays, the model is routinely adopted for the analysis of many scientific problems far beyond Internet applications, for example in computational chemistry, biology, bioinformatics, social network analysis, bibliometrics, software debugging and many others. For search engine rankings, the importance of web-pages is computed from the stationary probability vector of the random
process of a web surfer who keeps visiting a large set of web-pages connected by hyperlinks. The link structure of the World Wide Web is represented by a directed graph, the so-called web link graph, and its corresponding adjacency matrix $G \in \N^{n \times n}$ where $n$ denotes the number of pages and $G_{ij}$ is nonzero (being 1) only if the \emph{jth} page has a hyperlink pointing to the \emph{ith} page. The transition probability matrix $P \in \R^{n \times n}$ of the random process has entries
The PageRank model was proposed by Google in a series of papers to evaluate accurately the most important web-pages from the World Wide Web matching a set of keywords entered by a user. For search engine rankings, the importance of web-pages is computed from the stationary probability vector of the random process of a web surfer who keeps visiting a large set of web-pages connected by hyperlinks. The link structure of the World Wide Web is represented by a directed graph, the so-called web link graph, and its corresponding adjacency matrix $G \in \N^{n \times n}$ where $n$ denotes the number of pages and $G_{ij}$ is nonzero (being 1) only if the \emph{jth} page has a hyperlink pointing to the \emph{ith} page. The transition probability matrix $P \in \R^{n \times n}$ of the random process has entries as described in \ref{eq:transition}.
\begin{equation}\label{eq:transition}
P(i,j) =
@ -52,11 +53,16 @@ process of a web surfer who keeps visiting a large set of web-pages connected by
0 & \text{otherwise}
\end{cases}
\end{equation}
To ensure that the random process has a unique stationary distribution and it will not stagnate, the transition matrix P is usually modified to be an irreducible stochastic matrix $A$ (called the Google matrix) as follows
\noindent The entire random process needs a unique stationary distribution. To ensure this propriety is satisfied , the transition matrix $P$ is usually modified to be an irreducible stochastic matrix $A$ (called the Google matrix) as follows:
% \noindent To ensure that the random process has a unique stationary distribution and it will not stagnate, the transition matrix P is usually modified to be an irreducible stochastic matrix $A$ (called the Google matrix) as follows
\begin{equation}\label{eq:google}
A = \alpha \tilde P + (1 - \alpha)v e^T
\end{equation}
In \ref{eq:google} we define $\tilde P = P + vd^T$ where $d \in N^{n \times 1}$ is a binary vector tracing the indices of the damping web pages with no hyperlinks, i.e., $d(i) = 1$ if the \emph{ith} page ha no hyperlink, $v \in \R^{n \times n}$ is a probability vector, $e = [1, 1, ... ,1]^T$ and $0<\alpha<1$, the so-called damping factor that represents the probability in the model that the surfer transfer by clicking a hyperlink rather than other ways. Mathematically, the PageRank model can be formulated as the problem of finding the positive unit eigenvector $x$ (the so-called PageRank vector) such that
\noindent In \ref{eq:google} we have defines a new matrix called $\tilde P = P + vd^T$ where $d \in N^{n \times 1}$ is a binary vector tracing the indices of the damping web pages with no hyperlinks, i.e., $d(i) = 1$ if the \emph{i-th} page ha no hyperlink, $v \in \R^{n \times n}$ is a probability vector, $e = [1, 1, ... ,1]^T$ and $0<\alpha<1$, the so-called damping factor that represents the probability in the model that the surfer transfer by clicking a hyperlink rather than other ways. Mathematically, the PageRank model can be formulated as the problem of finding the positive unit eigenvector $x$ (the so-called PageRank vector) such that
\begin{equation}\label{eq:pr}
Ax = x, \quad \lVert x \rVert = 1, \quad x > 0
\end{equation}
@ -65,19 +71,20 @@ or, equivalently, as the solution of the linear system
(I - \alpha \tilde P)x = (1 - \alpha)v
\end{equation}
\noindent In the past decade or so, considerable research attention has been devoted to the efficient solution of problems \ref{eq:pr} \ref{eq:pr2}, especially when $n$ is very large. For moderate values of the damping factor, e.g. for $\alpha = 0.85$ as initially suggested by Google for search engine rankings, solution strategies based on the simple Power method have proved to be very effective. However, when $\alpha$ approaches 1, as is required in some applications, the convergence rates of classical stationary iterative methods including the Power method tend to deteriorate sharply, and more robust algorithms need to be used. \vspace*{0.4cm}
\noindent The authors of the paper \cite{SHEN2022126799} emphasize how in the in the past decade or so, considerable research attention has been devoted to the efficient solution of problems \ref{eq:pr} \ref{eq:pr2}, especially when $n$ is very large. For moderate values of the damping factor, e.g. for $\alpha = 0.85$ as initially suggested by Google for search engine rankings, solution strategies based on the simple Power method have proved to be very effective. However, when $\alpha$ approaches 1, as is required in some applications, the convergence rates of classical stationary iterative methods including the Power method tend to deteriorate sharply, and more robust algorithms need to be used. \vspace*{0.4cm}
\noindent One area that is largely unexplored in PageRank computations is the efficient solution of problems with the same network structure but multiple damping factors. For example, in the Random Alpha PageRank model used in the design of anti-spam mechanism \cite{Constantine2009Random}, the rankings corresponding to many different damping factors close to 1 need to be computed simultaneously. This problem can be expressed mathematically as solving a sequence of linear systems
\noindent In the reference paper that we are using for this project, the authors focus their attention in the area of PageRank computations with the same network structure but multiple damping factors. For example, in the Random Alpha PageRank model used in the design of anti-spam mechanism \cite{Constantine2009Random}, the rankings corresponding to many different damping factors close to 1 need to be computed simultaneously. They explain that the problem can be expressed mathematically as solving a sequence of linear systems
\begin{equation}\label{eq:pr3}
(I - \alpha_i \tilde P)x_i = (1 - \alpha_i)v \quad \alpha_i \in (0, 1) \quad \forall i \in \{1, 2, ..., s\} S
(I - \alpha_i \tilde P)x_i = (1 - \alpha_i)v \quad \alpha_i \in (0, 1) \quad \forall i \in \{1, 2, ..., s\}
\end{equation}
Conventional PageRank algorithms applied to \ref{eq:pr3} would solve the $s$ linear systems independently. Although these solutions can be performed in parallel, the process would still demand large computational resources for high dimension problems.
This consideration motivates the search of novel methods with reduced algorithmic and memory complexity, to afford the solution of larger problems on moderate computing resources. We can write the PageRank problem with multiple damping factors given at once (5) as a sequence of shifted linear systems of the form:
As we know, standard PageRank algorithms applied to \ref{eq:pr3} would solve the $s$ linear systems independently. Although these solutions can be performed in parallel, the process would still demand large computational resources for high dimension problems.
This consideration motived the authors to search novel methods with reduced algorithmic and memory complexity, to afford the solution of larger problems on moderate computing resources. They suggest to write the PageRank problem with multiple damping factors given at once \ref{eq:pr3} as a sequence of shifted linear systems of the form:
\begin{equation}
(\frac{1}{\alpha_i}I - \tilde P)x^{(i)} = \frac{1 - \alpha_i}{\alpha_i}v \quad \forall i \in \{1, 2, ..., s\} \quad 0 < \alpha_i < 1
\end{equation}
Shifted Krylov methods may still suffer from slow convergence when the damping factor approaches 1, requiring larger search spaces to converge with satisfactory speed, which in turn may lead to unaffordable storage requirements for large-scale engineering applications. As an attempt of a possible remedy in this situation, we present a framework that combines. shifted stationary iterative methods and shifted Krylov subspace methods. In detail, we derive the implementation of the
Power method that solves the PageRank problem with multiple damping factors at almost the same computational cost of the standard Power method for solving one single system. Furthermore, we demonstrate that this shifted Power method generates collinear residual vectors. Based on this result, we use the shifted Power iterations to provide smooth initial solutions for running shifted Krylov subspace methods such as GMRES. Besides, we discuss how to apply seed system choosing strategy and extrapolation techniques to further speed up the iterative process.
We know from literature that the Shifted Krylov methods may still suffer from slow convergence when the damping factor approaches 1, requiring larger search spaces to converge with satisfactory speed. In \cite{SHEN2022126799} is suggest that, to overcome this problem, we can combine stationary iterative methods and shifted Krylov subspace methods. They derive an implementation of the Power method that solves the PageRank problem with multiple dumpling factors at almost the same computational time of the standard Power method for solving one single system. They also demonstrate that this shifted Power method generates collinear residual vectors. Based on this result, they use the shifted Power iterations to provide smooth initial solutions for running shifted Krylov subspace methods such as \texttt{GMRES}. Besides, they discuss how to apply seed system choosing strategy and extrapolation techniques to further speed up the iterative process.
% As an attempt of a possible remedy in this situation, we present a framework that combines. shifted stationary iterative methods and shifted Krylov subspace methods. In detail, we derive the implementation of the Power method that solves the PageRank problem with multiple damping factors at almost the same computational cost of the standard Power method for solving one single system. Furthermore, we demonstrate that this shifted Power method generates collinear residual vectors. Based on this result, we use the shifted Power iterations to provide smooth initial solutions for running shifted Krylov subspace methods such as GMRES. Besides, we discuss how to apply seed system choosing strategy and extrapolation techniques to further speed up the iterative process.
\subsection{Overview of the classical PageRank problem}
The Power method is considered one of the algorithms of choice for solving either the eigenvalue \ref{eq:pr} or the linear system \ref{eq:pr2} formulation of the PageRank problem, as it was originally used by Google. Power iterations write as
@ -86,24 +93,23 @@ The Power method is considered one of the algorithms of choice for solving eithe
\end{equation}
The convergence behavior is determined mainly by the ratio between the two largest eigenvalues of A. When $\alpha$ gets closer to $1$, though, the convergence can slow down significantly. \\
\noindent As stated in \cite{SHEN2022126799} The number of iterations required to reduce the initial residual down to a tolerance $\tau$, measured as $\tau = \lVert Ax_k - x_k \rVert = \lVert x_{k+1} - x_k \rVert$ can be estimated as $\frac{\log_{10} \tau}{\log_{10} \alpha}$. For example, when $\tau = 10^{-8}$ the Power method requires about 175 steps to converge for $\alpha = 0.9$ but the iteration count rapidly grows to 1833 for $\alpha = 0.99$. Therefore, for values of the damping parameter very close to 1 more robust alternatives to the simple Power algorithm should be used.
\noindent As stated in \cite{SHEN2022126799} The number of iterations required to reduce the initial residual down to a tolerance $\tau$, measured as $\tau = \lVert Ax_k - x_k \rVert = \lVert x_{k+1} - x_k \rVert$ can be estimated as $\frac{\log_{10} \tau}{\log_{10} \alpha}$. The authors provide an example: when $\tau = 10^{-8}$ the Power method requires about 175 steps to converge for $\alpha = 0.9$ but the iteration count rapidly grows to 1833 for $\alpha = 0.99$. Therefore, for values of the damping parameter very close to 1 more robust alternatives to the simple Power algorithm should be used.
\clearpage
\section{The shifted power method for PageRank computations}
In this section we consider extensions of stationary iterative methods for the solution of PageRank problems with multiple damping factors. We look in particular at the Power method, the Gauss-Seidel method, and the GIO iteration scheme. We are concerned with how these methods can be executed with the highest efficiency for solving such problems, especially with the question: for each method, whether there exist an implementation such that the computational cost of solving the PageRank problem with multiple damping factor is comparable to that of solving the ordinary PageRank problem with single damping factor.
In this section presents the extensions of stationary iterative methods for the solution of PageRank problems with multiple damping factors, as presented in \cite{SHEN2022126799}. We are interested in knowing if, for each method, there exists an implementation such that the computational cost of solving the PageRank problem with multiple damping factor is comparable to that of solving the ordinary PageRank problem with single damping factor.
\subsection{The implementation of the shifted power method}
Inspired by the reason why shifted Krylov subspaces can save computational cost, we investigate whether there are duplications in the calculations of multiple linear systems in this problem class by the stationary iterative methods, so that the duplications in the computation can be deleted, or in other words, the associate operations can be computed only once and used for all systems. We first analyze the Power method applied to the sequence of linear systems in \ref{eq:pr2}. It computes
at the kth iteration approximate solutions $x_k (i) (1 \leq i \leq s)$ of the form
Inspired by the reason why shifted Krylov subspaces can save computational cost, the authors of \cite{SHEN2022126799} investigate whether there are duplications in the calculations of multiple linear systems in this problem class by the stationary iterative methods, so that the duplications in the computation can be deleted and used for all systems. It's some sort of dynamic programming approach. Firstly, they analyze the Power method applied to the sequence of linear systems in \ref{eq:pr2}. It computes at the \emph{i-th} iteration approximate solutions $x_k (i) (1 \leq i \leq s)$ of the form
\begin{equation}
\alpha_i^k \tilde P^k x_k^{(i)} + (1 - \alpha_i^k) \sum_{j=0}^{k-1} \alpha_i^j \tilde P^j v
\end{equation}
If the s systems in \ref{eq:pr2} are solved synchronously, that is all $x^{(i)}_k$ are computed only after all previous approximations $x^(j)_{k-1}$ are available, then the computation can be rearranged efficiently as follows:
If the $s$ systems in \ref{eq:pr2} are solved synchronously, that is all $x^{(i)}_k$ are computed only after all previous approximations $x^{(j)}_{k-1}$ are available, then the computation can be rearranged efficiently as follows:
\begin{itemize}
\item at the first iterations
\begin{itemize}
\item compute and store $\mu_1 = \tilde P x_0$ and $\mu_2 = v$;
\item compute and store $x_1^(i) = \alpha_i \mu_1 + (1-\alpha_i)\mu_2;$
\item compute and store $x_1^{(i)} = \alpha_i \mu_1 + (1-\alpha_i)\mu_2;$
\end{itemize}
\item at any other subsequent iteration $k>1$
\begin{itemize}
@ -112,15 +118,19 @@ If the s systems in \ref{eq:pr2} are solved synchronously, that is all $x^{(i)}_
\item compute and store $x_k^{(i)} = \alpha_i \mu_1 + x_k^{(i)} + (1-\alpha_i)\alpha^{k-1}_i \mu_2$.
\end{itemize}
\end{itemize}
This implementation requires at most 2 matrix-vector products at each step, which is a significant gain compared to the $s$ matrix-vector products required by the standard Power method to compute $x^{(i)}_{k+1}$ , especially when $s \gg 2$. This is close to the computational cost, i.e. 1 matrix-vector product per iteration, of using the Power method for computing PageRank with single damping factor. \\
This implementation requires at most $2$ matrix-vector products at each step, which is a significant gain compared to the $s$ matrix-vector products required by the standard Power method to compute $x^{(i)}_{k+1}$ , especially when $s \gg 2$. \vspace{0.4cm}
\noindent An efficient implementation can compute and store $\mu = \tilde Pv -v$ at the first iteration and store $\mu = \tilde P^{k-1}(\tilde P v - v) = \tilde P \cdot (\tilde P^{k-2}(\tilde P v - v))$ at each \emph{kth} iteration ($k > 1$), and finally from each approximate solution as $x_k^{(i)} = \alpha_i^k \mu + x_{k-1}^{(i)}$. The residual vector $r_k^{(i)}$ associated with the approximate solution $x_k^{(i)}$ has the following expression
\noindent An efficient implementation can compute and store $\mu = \tilde Pv -v$ at the first iteration and store
$$\mu = \tilde P^{k-1}(\tilde P v - v) = \tilde P \cdot (\tilde P^{k-2}(\tilde P v - v))$$
At each \emph{k-th} iteration ($k > 1$), and finally from each approximate solution as $x_k^{(i)} = \alpha_i^k \mu + x_{k-1}^{(i)}$. The residual vector $r_k^{(i)}$ associated with the approximate solution $x_k^{(i)}$ has the following expression
\begin{equation}
r_k^{(i)} = A x_k^{(i)} - x_k^{(i)} = x_{k+1}^{(i)} - x_k^{(i)} = \alpha_i^{k+1} \tilde P^k (\tilde P v - v)
\end{equation}
Since in general each of the $s$ linear systems may require a different number of Power iterations to converge, the $s$ residual norms have to be monitored separately to test the convergence. We summarize the efficient implementation of the Power method that we presented in this section for solving problem \ref{eq:pr2} in Algorithm 1, and we call it the shifted Power method hereafter.
Since in general each of the $s$ linear systems may require a different number of Power iterations to converge, the $s$ residual norms have to be monitored separately to test the convergence. \vspace{0.4cm}
\begin{algorithm}
\noindent Now we can summarize the efficient implementation of the Power method presented in this section for solving problem \ref{eq:pr2} in Algorithm \ref{alg:algo1}, and we call it the shifted Power method hereafter.
\begin{algorithm}\label{alg:algo1}
\caption{Shifted-Power method for PageRank with multiple damping factors}\label{alg:algo1}
\begin{algorithmic}
\Require $\tilde P, ~v, ~\tau, ~\max_{mv}, ~\alpha_i ~ (1 \leq i \leq s)$
@ -151,7 +161,29 @@ Since in general each of the $s$ linear systems may require a different number o
\end{algorithmic}
\end{algorithm}
\noindent Where $mv$ is an integer that counts the number of matrix-vector products performed by the algorithm. The algorithm stops when either all the residual norms are smaller than the tolerance $\tau$ or the maximum number of matrix-vector products is reached.
\noindent Where $mv$ is an integer that counts the number of matrix-vector products performed by the algorithm. The algorithm stops when either all the residual norms are smaller than the tolerance $\tau$ or the maximum number of matrix-vector products is reached. An implementation of this algorithm written in Python is available in the github repository \cite{ShfitedPowGMRES} of this project.

@ -24,3 +24,14 @@ abstract = {}
month=1,
day=1,
}
# cite this repo made by Luca Lombardo https://github.com/lukefleed/ShfitedPowGMRES
@misc{ShfitedPowGMRES,
author = {Lombardo, Luca},
title = {Shifted power method for solving PageRank with multiple damping factors},
year = {2022},
publisher = {},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lukefleed/ShfitedPowGMRES}}
}

Loading…
Cancel
Save