You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

486 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import networkx as nx\n",
"import time\n",
"import math\n",
"import pandas as pd\n",
"import scipy as sp\n",
"import plotly.express as px\n",
"import plotly.graph_objs as go\n",
"from scipy.sparse import *\n",
"from scipy import linalg\n",
"from scipy.sparse.linalg import norm\n",
"from scipy.optimize import least_squares"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create two graphs from the list of edges downloaded from the Snap database. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G1 = nx.read_edgelist('../data/web-Stanford.txt', create_using=nx.DiGraph(), nodetype=int)\n",
"\n",
"# G2 = nx.read_edgelist('../data/web-BerkStan.txt', create_using=nx.DiGraph(), nodetype=int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating the transition probability matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# square matrix of size n x n, where n is the number of nodes in the graph. The matrix is filled with zeros and the (i,j) element is x if the node i is connected to the node j. Where x is 1/(number of nodes connected to i).\n",
"\n",
"def create_matrix(G):\n",
" n = G.number_of_nodes()\n",
" P = sp.sparse.lil_matrix((n,n))\n",
" for i in G.nodes():\n",
" for j in G[i]: #G[i] is the list of nodes connected to i, it's neighbors\n",
" P[i-1,j-1] = 1/len(G[i])\n",
" return P"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To ensure that the random process has a unique stationary distribution and it will not stagnate, the transition matrix P is usually modified to be an irreducible stochastic matrix A (called the Google matrix) as follows\n",
"\n",
"$$ A = \\alpha \\tilde{P} + (1-\\alpha)v e^T$$\n",
"\n",
"Where $\\tilde{P}$ is defined as \n",
"\n",
"$$ \\tilde{P} = P + v d^T$$\n",
"\n",
"Where $d \\in \\mathbb{N}^{n \\times 1}$ s a binary vector tracing the indices of dangling web-pages with no hyperlinks, i.e., $d(i ) = 1$ if the `ith` page has no hyperlink, $v \\in \\mathbb{R}^{n \\times 1}$ is a probability vector, $e = [1, 1, . . . , 1]^T$ , and $0 < \\alpha < 1$ is the so-called damping factor that represents the probability in the model that the surfer transfer by clicking a hyperlink rather than other ways"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n = G1.number_of_nodes()\n",
"P = create_matrix(G1) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the vector `d` solves the dangling nodes problem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# define d as a nx1 sparse matrix, where n is the number of nodes in the graph. The vector is filled with d(i) = 1 if the i row of the matrix P is filled with zeros, other wise is 0\n",
"\n",
"# d is the vector of dangling nodes\n",
"d = sp.sparse.lil_matrix((n,1))\n",
"for i in range(n):\n",
" if P[i].sum() == 0:\n",
" d[i] = 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The vector v is a probability vector, the sum of its elements bust be one"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# define v as the probability vector of size n x 1, where n is the number of nodes in the graph. The vector is filled with 1/n\n",
"# https://en.wikipedia.org/wiki/Probability_vector\n",
"\n",
"v = sp.sparse.lil_matrix((n,1))\n",
"for i in range(n):\n",
" v[i] = 1/n "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can compute the transition matrix\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Pt = P + v.dot(d.T)\n",
"\n",
"# Pt is a sparse matrix too"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# e is a nx1 sparse matrix filled with ones\n",
"e = sp.sparse.lil_matrix((1,n))\n",
"for i in range(n):\n",
" e[0,i] = 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # v*eT is a nxn sparse matrix filled all with 1/n, let's call it B\n",
"\n",
"# B = sp.sparse.lil_matrix((n,n))\n",
"# for i in range(n):\n",
"# for j in range(n):\n",
"# B[i,j] = 1/n\n",
"\n",
"# A = alpha*Pt + (1-alpha)*B"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Algorithm 1 Shifted-Power method for PageRank with multiple damping factors:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pandas dataframe to store the results\n",
"df = pd.DataFrame(columns=['alpha', 'iterations', 'tau', 'time'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this should return mv (the number of iteration needed for the convergence), and two vector called x and r. Where x is the vector of the pagerank and r is the residual vector\n",
"\n",
"def Algorithm1(Pt, v, tau, max_mv, a: list):\n",
" \n",
" start_time = time.time()\n",
"\n",
" u = Pt.dot(v) - v \n",
" mv = 1 # number of matrix vector products\n",
" r = sp.sparse.lil_matrix((n,1)) \n",
" Res = sp.sparse.lil_matrix((len(a),1))\n",
" x = sp.sparse.lil_matrix((n,1)) \n",
"\n",
" for i in range(len(a)):\n",
" r = a[i]*(u) \n",
" normed_r = norm(r)\n",
" Res[i] = normed_r \n",
"\n",
" if Res[i] > tau:\n",
" x = r + v \n",
"\n",
" while max(Res) > tau and mv < max_mv:\n",
" u = Pt*u # should it be the same u of the beginning?\n",
" mv += 1 \n",
"\n",
" for i in range(len(a)):\n",
" if Res[i] >= tau: \n",
" r = (a[i]**(mv+1))*(u)\n",
" Res[i] = norm(r)\n",
"\n",
" if Res[i] > tau:\n",
" x = r + x\n",
"\n",
" if mv == max_mv:\n",
" print(\"The algorithm didn't converge in \", max_mv, \" iterations\")\n",
" else:\n",
" print(\"The algorithm converged in \", mv, \" iterations\")\n",
"\n",
" total_time = time.time() - start_time\n",
" print(\"The algorithm took \", total_time, \" seconds\")\n",
" \n",
" return mv, x, r, total_time "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# list of alpha values, from 0.85 to 0.99 with step 0.01\n",
"a = []\n",
"for i in range(85,100):\n",
" a.append(i/100)\n",
"\n",
"max_mv = 1000\n",
"\n",
"# run the algorithm for different values of tau from 10^-5 to 10^-9 with step 10^-1\n",
"for i in range(5,10):\n",
" tau = 10**(-i)\n",
" print(\"\\ntau = \", tau)\n",
" mv, x, r, total_time = Algorithm1(Pt, v, tau, max_mv, a)\n",
" df = df.append({'alpha': a, 'iterations': mv, 'tau': tau, 'time': total_time}, ignore_index=True) \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save the results in a csv file\n",
"df.to_csv('../data/results/algo1/different_tau.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting the results of the algorithm for different values of tau, and fixed alpha"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = df['tau'][::-1].tolist()\n",
"y = df['iterations'].tolist()\n",
"\n",
"fig1 = go.Figure(data=go.Scatter(x=x, y=y, mode='lines+markers'), \n",
" layout=go.Layout(title='Iterations needed for the convergence', xaxis_title='tau', yaxis_title='iterations'))\n",
" \n",
"# save the plot in a html file\n",
"fig1.write_html(\"../data/results/algo1/taus_over_iterations.html\")\n",
"\n",
"##### RESULTS OVER TIME #####\n",
"\n",
"x1 = df['tau'][::-1].tolist()\n",
"y1 = df['time'].tolist()\n",
"\n",
"fig2 = go.Figure(data=go.Scatter(x=x1, y=y1, mode='lines+markers'),\n",
" layout=go.Layout(title='Time needed for the convergence', xaxis_title='tau', yaxis_title='time (seconds)'))\n",
"\n",
"# save the plot in a html file\n",
"fig2.write_html(\"../data/results/algo1/taus_over_time.html\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To view the graph just use the command\n",
"\n",
"```bash\n",
"firefox taus_over_iterations.html \n",
"```\n",
"or \n",
"\n",
"```bash\n",
"firefox taus_over_time.html\n",
"```\n",
"\n",
"_In the right folder_"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def Arnoldi(A, v, m): # defined ad algorithm 2 in the paper\n",
" beta = norm(v)\n",
" print(\"A\")\n",
" v = v/beta\n",
" print(\"B\")\n",
" h = sp.sparse.lil_matrix((m,m))\n",
" print(\"C\")\n",
"\n",
" for j in range(m):\n",
" w = A.dot(v)\n",
" print(\"D\")\n",
" for i in range(j):\n",
" h[i,j] = v.T.dot(w)\n",
" print(\"E\")\n",
" w = w - h[i,j]*v[i]\n",
" print(\"F\")\n",
"\n",
" h[j+1,j] = norm(w)\n",
" print(\"G\")\n",
"\n",
" if h[j+1,j] == 0:\n",
" print(\"The algorithm didn't converge\")\n",
" m = j\n",
" v[m+1] = 0\n",
" break\n",
" else:\n",
" print(\"H\")\n",
" v[j+1] = w**h[j+1,j]\n",
" print(\"I\")\n",
"\n",
" return v, h, m, beta, j"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"A = sp.sparse.rand(100,100, density=0.5, format='lil')\n",
"v = sp.sparse.rand(100,1, density=1, format='lil')\n",
"m = 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"v, h, m, beta, j = Arnoldi(A, v, m)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def Algo4(Pt, v, m, a: list, tau, maxit: int, x):\n",
" \n",
" iter = 1\n",
" mv = 0\n",
" e1 = sp.sparse.lil_matrix((1,n))\n",
" e1[0,0] = 1\n",
" x = sp.sparse.lil_matrix((len(a),1))\n",
" I = sp.sparse.eye(n, n, format='lil')\n",
" res = sp.sparse.lil_matrix((len(a),1))\n",
" r = sp.sparse.lil_matrix((n,1))\n",
" y = sp.sparse.lil_matrix((n,1))\n",
"\n",
" for i in range(len(a)):\n",
" r = ((1-a[i])**a[i])*v - ((1**a[i])*I - Pt).dot(x)\n",
" res[i] = a[i]*norm(r)\n",
"\n",
" def Find_k(res, maxit):\n",
" k = 0\n",
" for i in range(len(a)):\n",
" if res[i] == max(res):\n",
" k = i\n",
" break\n",
" return k\n",
"\n",
" def Find_gamma(res, a, k):\n",
" gamma = sp.sparse.lil_matrix((len(a),1))\n",
" for i in range(len(a)):\n",
" if i != k:\n",
" gamma[i] = (res[i]*a[k])/(res[k]*a[i])\n",
" else:\n",
" gamma[i] = 0\n",
" return gamma\n",
"\n",
"\n",
" while max(res) > tau and iter < maxit:\n",
" k = Find_k(res, maxit)\n",
" gamma = Find_gamma(res, a, k)\n",
" v, h, m, beta, j = Arnoldi((1**a[k])*I - Pt, r, m)\n",
" Hbar = sp.sparse.lil_matrix((m+1,m))\n",
" Hbar[0:m,0:m] = h\n",
" Hbar[m+1,0:m] = e1\n",
"\n",
" mv += j\n",
"\n",
" # solve the least squares problem for Hbar*x = beta*e1\n",
" y = sp.sparse.linalg.least_squares(Hbar, beta*e1)\n",
" res[k] = a[k]*norm(beta*e1 - Hbar*y)\n",
" x[k] = x[k] + v*y[k]\n",
"\n",
" for i in range(len(a)):\n",
" if i != k:\n",
" if res[i] >= tau:\n",
" Hbar[i] = Hbar[k] + ((1-a[i])/a[i] - (1-a[k])/a[k])*I\n",
" z = beta*e1 - Hbar*y\n",
" y = sp.sparse.linalg.solve(Hbar, gamma*beta*e1)\n",
" x = x + v*y\n",
" res[i] = a[i]**a[k]*gamma[i]*res[k]\n",
" \n",
" iter += 1\n",
" \n",
" return x, res, mv\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}