arnoldi-distribuito/typst-report/main.typ

#import "theme.typ": ams-article, definition, theorem, proposition, proof, todo

#import "@preview/algorithmic:0.1.0"
#import algorithmic: algorithm

#show: ams-article.with(
  paper-size: "a4",
  title: [Distributed Arnoldi Method for Eigenvalues using PETSc & Lapack],
  authors: (
    (
      name: "Antonio De Lucreziis",
      organization: [Dipartimento di Matematica],
      location: [Pisa, Italia],
      email: "antonio.delucreziis@gmail.com",
      url: "https://poisson.phc.dm.unipi.it/~delucreziis/"
    ),
  ),
  abstract: [
    In this project we implement the Arnoldi Method for eigenvalues using #link("https://www.netlib.org/lapack/")[Lapack] #link("https://petsc.org/release/")[PETSc], #link("https://www.open-mpi.org/")[OpenMPI] libraries for distributed matrix and vector operations and run some numerical simulations on a small cluster of $tilde.basic 30$ nodes of #link("https://radxa.com/products/rock4/4cp")[Rock 4C+] boards we built for the course of #link("https://fdurastante.github.io/courses/highperfmathematics24.html")[High-Performance Mathematics]. The code can be found at #link("https://github.com/aziis98/arnoldi-distribuito")[github.com/aziis98/arnoldi-distribuito].
  ]
)

#outline()

= Arnoldi Method for Eigenvalues

== Theory of Krylov subspaces

#definition[
  The Krylov subspace of order $ell$ associated with $A$ and $b$ is the subspace
  $
    cal(K)_ell (A, b) colon.eq op("span")(b, A b, dots, A^(ell-1) b)
  $
]

#definition[
  If $x_ell$ is an approximation of the solution to $A x = b$ then the residual is defined as
  $
    r_ell colon.eq b - A x_ell
  $
]

There are various criteria that can be used to identify the "optimal" solution

- *FOM* -- Have the residual is orthogonal to $cal(K)_ell(A, b)$.

- *GMRES* -- Have the residual minimized such that $x_ell$ is

$
  x_ell colon.eq op("argmin")_(x in cal(K)_ell(A, b)) norm(b - A x)_2
$

We are now interested in studying the *FOM* case and the behavior when $ell << n$.

#pagebreak()

== Arnoldi Iteration

We now want to construct a nice basis for $cal(K)_ell(A, b)$ as the one directly given by $A^j b$ has very poor computational properties. We need a way of constructing an orthogonal basis of the Krylov subspace without using the QR algorithm

Let's define the Arnoldi iteration that works as follows

- First let's define $v_1 colon.eq b slash norm(b)_2$

- For $j = 1, dots, ell$

  - Let's define $w_(j+1) colon.eq A v_j$

  - The to obtain the next vector let's orthogonalize it to the current basis $v_1, dots,  v_j$ and let

    $
    pi_j(w_(j+1)) colon.eq sum_(i <= j) (v_i^* w_(j+1)) v_i \
    v_(j+1)
    colon.eq
    (w_(j+1) - pi_j(w_(j+1)))/(norm(w_(j+1) - pi_j(w_(j+1)))_2)
    $

Let's also define the following *upper Hessenberg matrix* composed of all the projection coefficients

$
  h_(i,j) colon.eq v_i^* w_(j+1) = v_i^* A v_j \
  H_ell colon.eq
  mat(
    h_(1,1), dots, dots, h_(1,ell);
    h_(2,1), h_(2,2), "", dots.v;
    "", dots.down, dots.down, dots.v;
    "", "", h_(ell,ell-1), h_(ell,ell);
  )
$

The final algorithm for the *Arnoldi iteration* is

#align(center,
  block(
    width: 18em,
    align(left,
      algorithm({
        import algorithmic: *
        Function("Arnoldi-Iteration", args: ($A$, $b$, $ell$), {
          Assign[$q_1$][$b \/ norm(b)_2$]
          For(cond: $k = 2, .., ell$, {
            Assign[$q_k$][$A q_(k-1)$]
            For(cond: $j = 1, ..., k - 1$, {
              Assign[$h_(j,k-1)$][$q_j^* q_k$]
              Assign[$q_k$][$q_k - h_(j,k-1) q_j$]
            })
            Assign[$h_(k,k-1)$][$norm(q_k)_2$]
            Assign[$q_k$][$q_k \/ h_(k,k-1)$]
          })
        })
      })
    )
  )
)

== Arnoldi iteration for eigenvalues and eigenvectors

To use the Arnoldi iteration to compute the eigenvalues of $A$ we can use the following procedure

- First use the Arnoldi iteration to compute the upper Hessenberg matrix $H_ell$ for the Krylov subspace $cal(K)_ell(A, b)$.

- Compute the eigenvalues and eigenvectors of this Hessenberg matrix by computing a QR decomposition of the matrix $H_ell$.

The following theorem states the relationship between the eigenvalues of $H_ell$ and $A$

#theorem[
  Let $H_ell$ be the upper Hessenberg matrix from the Arnoldi process for $A$ (without breakdown), starting from $b$. Then, the characteristic polynomial $p(lambda) = det(lambda I - H_ell)$ satisfies
  $
    min_(p in cal(P)_ell \ p_ell = 1) norm(p(A) b)_2
  $
]

So the characteristic polynomial of $H_ell$ is a good approximation of the characteristic polynomial of $A$.

To benchmark the performance of our implementation of the distributed Arnoldi method we will use the test problem of finding the eigenvalues of the 3D Laplacian that has a nice closed form solution.

== 3D Laplacian

The matrix we will consider comes up from the Laplace equation $u : RR^n -> RR$, for example with zero Dirichlet boundary condition
on the unit cube $[0, 1]^n$ we get the following problem

$
cases(
  laplace u(x) = 0 #h(4em) & x in [0, 1]^3,
  u(x) = u_0(x) & x in partial [0, 1]^3
)
$

Let's discretize this problem using the finite difference method. We will use a uniform grid with $n_i$ points in the $i$-th direction (and also let $h_i colon.eq 1 slash (n_i + 1)$). In dimension $n = 1$ the problem is just the second derivative of $u$ with respect to $x$ and the finite difference approximation is

$
  D_(x x) u(x) = (u(x + h_x) - 2 u(x) + u(x - h_x)) / (h_x^2)
$

where $h_x colon.eq 1 slash (n_1 + 1)$ is the distance between two consecutive points. Let's notice that here the stencil has shape $mat(1, -2, 1, delim: "[")$. If we consider the matrix $upright(bold(D))_(x x)$ associated with the finite difference operator, we can write the problem as $upright(bold(D))_(x x) overline(u) = 0$ where $overline(u)$ is the vector of values of $u$ at the grid points. Where $upright(bold(D))_(x x)$ is the following tridiagonal matrix

$
  upright(bold(D))_(x x) =
  1 / h_1^2
  mat(
    -2, 1,  , , ;
    1, -2, 1, , ;
    , dots.down, dots.down, dots.down, ;
    , , 1, -2, 1;
    , , , 1, -2;
    delim: "["
  )
$

A tridiagonal matrix with constant diagonals of $a$, $b$, $c$ has the following eigenvalues
$
  lambda_k = a - 2 sqrt(b c) cos((pi k) / (n + 1))
$
for $k = 1, dots, n$. In this case we have $a = -2$, $b = 1$, $c = 1$ so the eigenvalues are given by the following expression and have the following bounds
$
  lambda_k
  &= -2 - 2 cos((pi k) / (n + 1)) \
  & = -2 (1 + cos(pi k / (n + 1)), in [-1, 1]) \
  &= -4 underbrace(cos^2(pi / 2 dot k / (n + 1)), in [0, 1])
  ==> -4 < lambda_k < 0 \
$
// #todo(msg: "Check math")[
//   To easily compute solutions to the Laplace equation we can use the fact that the solution can be written as a linear combination of the eigenvectors of the Laplacian as follows, let $u_0(x)$ be the initial condition, then we can write the solution as
//   $
//     u(x) = sum_(i = 1)^(n_1) a_i v_i(x)
//     a_i = v_i^* u_0(x)
//   $
//   where $v_i(x)$ is the eigenvector of the Laplacian and $a_i$ is the coefficient of the linear combination.
// ]

For dimension $n = 2$ we have the following stencils

$
  mat(
    , 1,  ;
    1, -4, 1;
    , 1,  ;
    delim: "["
  )
  =
  mat(1, -2, 1; delim: "[")
  +
  mat(1;-2;1; delim: "[")
  \
  #rotate(90deg, $ arrow.squiggly $)
  \
  (D_(x x) + D_(y y)) u(x, y) = (u(x - h_1, y) + u(x + h_1, y) + u(x, y - h_y) + u(x, y + h_y) - 4 u(x, y)) / (h_1^2)
$

In this case the matrix associated with the finite difference operator can be written compactly as the sum of two Kronecker products
$
  upright(bold(D))_(x x) + upright(bold(D))_(y y) =
  upright(bold(I)) times.circle upright(bold(D))_(y y)
  +
  upright(bold(D))_(x x) times.circle upright(bold(I))
$

where $upright(bold(I))$ is the identity matrix. The Kronecker product $upright(bold(I)) times.circle upright(bold(D))_(y y)$ is a block diagonal matrix with copies of $upright(bold(D))_(y y)$ on the diagonal and $upright(bold(I)) times.circle upright(bold(D))_(x x)$ is a block diagonal matrix with copies of $upright(bold(D))_(x x)$ on the diagonal.

This can also be generalized to the case of $n = 3$ and more dimensions. The matrix for the discretized 3D Laplacian can be written as

$
  L = upright(bold(D_(x x))) times.circle upright(bold(I)) times.circle upright(bold(I))
    + upright(bold(I)) times.circle upright(bold(D_(y y))) times.circle upright(bold(I))
    + upright(bold(I)) times.circle upright(bold(I)) times.circle upright(bold(D_(z z)))
$

To compute the eigenvalues and eigenvectors of this matrix we can use the following facts about the Kronecker products.

#proposition[
  The Kronecker product has the following properties

  - The Kronecker product distributes over the matrix product, i.e.
    $
      (A times.circle B) (C times.circle D) = (A C) times.circle (B D)
    $

  - Conjugation commutes with the Kronecker product, i.e.
    $
      (A times.circle B)^* = A^* times.circle B^*
    $

  - Let $A = A_1 times.circle A_2 times.circle dots times.circle A_n$ be the Kronecker product of $n$ matrices. Then, the eigenvalues of $A$ are given by the products of the eigenvalues of the $A_i$
]

#proposition[
  Given two diagonalizable matrices $A$ and $B$ with eigenvalues $lambda_i$ and $mu_j$, respectively the expression

  $
    A times.circle I + I times.circle B
  $

  has eigenvalues $lambda_i + mu_j$ and a matrix of eigenvectors is $V times.circle W$ where $V$ and $W$ are the matrices of eigenvectors of $A$ and $B$, respectively

  $
    #grid(
      columns: 2,
      rows: 2,
      align: center + horizon,
      row-gutter: 1em,
      column-gutter: 2em,
      $
        A = V D_A V^*
      $,
      $
        B = W D_B W^*
      $,
      $
        D_A = mat(
          dots.down, , ;
          , lambda_i, ;
          , , dots.down;
          delim: "["
        )
      $,
      $
        D_B = mat(
          dots.down, , ;
          , mu_j, ;
          , , dots.down;
          delim: "["
        )
      $
    )
  $
]

#proof[
  Let's consider $D_A = V^* D_A V$ and $D_B = W^* D_B W$. Then, by conjugation with $V times.circle W$ we have
  $
    & (V times.circle W)^* (A times.circle I + I times.circle B) (V times.circle W) \
    &= (V^* times.circle W^*) (A times.circle I + I times.circle B) (V times.circle W) \
    &= (V^* times.circle W^*) (A times.circle I) (V times.circle W) + (V^* times.circle W^*) (I times.circle B) (V times.circle W) \
    &= (V^* A V times.circle I) + (I times.circle W^* B W) \
    &= D_A times.circle I + I times.circle D_B
  $
  to conclude let's analyze the block structure of this last term
  $
    D_A times.circle I + I times.circle D_B
    = mat(
      dots.down, , ;
      , lambda_i #rect(stroke: 0.5pt, $I$), ;
      , , dots.down;
      delim: "["
    )
    +
    mat(
      delim: "[",
      dots.down, , ;
      , #rect(stroke: 0.5pt, $D_B$), ;
      , , dots.down;
    )
    = mat(
      dots.down, , ;
      , #rect(stroke: 0.5pt, $lambda_i I + D_B$), ;
      , , dots.down;
      delim: "["
    )
  $
  so we can produce all possible combinations of sums of $lambda_i$ and $mu_j$ so the final eigenvalues are given by
  $
    = mat(
      dots.down, , ;
      , lambda_i + mu_j, ;
      , , dots.down;
      delim: "["
    )
  $
]

This fact can also be generalized to the case of $n$ dimensions. So recalling that the eigenvalues of the 1D Laplacian are given by
$
  lambda_i = -4 cos^2((pi i) / (2(n + 1)))
$
we can write the eigenvalues of the 3D Laplacian as
$
  lambda_(i_1, i_2, i_3) =
  -4 (cos^2((pi i_1) / (2(n_1 + 1))) +
  cos^2((pi i_2) / (2(n_2 + 1))) +
  cos^2((pi i_3) / (2(n_3 + 1))))

$
for $(i_1, i_2, i_3)$ in $[1, dots, n_1]$, $[1, dots, n_2]$, $[1, dots, n_3]$ respectively. This is approximately in the range $[-12, 0]$.

// TODO

// The eigenvector for the 1D Laplacian is given by
// $
//   v_(i; thin j) =
//   sqrt(2 / (n + 1)) sin((pi i j) / (n + 1))
// $
// for $i = 1, dots, n_1$ so the eigenvector for the 3D Laplacian is
// $
//   v_(i_1, i_2, i_3; thin j_1, j_2, j_3) =
//   sqrt(2 / (n_1 + 1)) sin((pi i_1 j_1) / (n_1 + 1))
//   sqrt(2 / (n_2 + 1)) sin((pi i_2 j_2) / (n_2 + 1))
//   sqrt(2 / (n_3 + 1)) sin((pi i_3 j_3) / (n_3 + 1))
// $
// for the eigenvalue with multi-index $(i_1, i_2, i_3)$ and where $(j_1, j_2, j_3)$ is the multi-index for the various components of the eigenvector.

#pagebreak()

= PETSc Implementation

Let's look again at the algorithm for the Arnoldi iteration. There are various operations that can be done in parallel.

#align(center,
  block(
    width: 18em,
    align(left,
      algorithm({
        import algorithmic: *
        Function("Arnoldi-Iteration", args: ($A$, $b$, $ell$), {
          Assign[$q_1$][$b \/ norm(b)_2$]
          For(cond: $k = 2, .., ell$, {
            Assign[$q_k$][$A q_(k-1)$]
            For(cond: $j = 1, ..., k - 1$, {
              Assign[$h_(j,k-1)$][$q_j^* q_k$]
              Assign[$q_k$][$q_k - h_(j,k-1) q_j$]
            })
            Assign[$h_(k,k-1)$][$norm(q_k)_2$]
            Assign[$q_k$][$q_k \/ h_(k,k-1)$]
          })
        })
      })
    )
  )
)

If we store the matrix $A$ and the vector $b$ in a distributed way, we can compute the following operations in parallel

- The normalization of $b$ using the function #link("https://petsc.org/release/manualpages/Vec/VecNormalize/")[`VecNormalize`].

- The matrix-vector product $A q_k$ using #link("https://petsc.org/release/manualpages/Mat/MatMult/")[`MatMult`].

- The inner product $q_j^* q_k$ using #link("https://petsc.org/release/manualpages/Vec/VecDot/")[`VecDot`].

- The operation $q_k - h_(j,k-1) q_j$ can be done in parallel as well. More precisely, this can be done as a _fused multiply-add_ operation using the PETSc #link("https://petsc.org/release/manualpages/Vec/VecAXPY/")[`VecAXPY`] function.

- The norm of $q_k$ can be computed using #link("https://petsc.org/release/manualpages/Vec/VecNorm/")[`VecNorm`].

The final algorithm written in C is the following, we will omit all calls to `PetscCall(...)` and other PETSc related infrastructure for the sake of brevity.

#align(center, box(fill: luma(93%), radius: 5pt, inset: 5pt)[
  #set text(size: 8.5pt)
  #set par(leading: 4pt)
  ```c
  PetscErrorCode ArnoldiIteration(Mat A, Vec b, PetscInt n, PetscInt m, Vec *Q, double *h) {
      Vec q;
      for (PetscInt i = 0; i < n + 1; i++) {
          for (PetscInt j = 0; j < n; j++) {
              h[i * n + j] = 0.0;
          }
      }
      VecDuplicate(b, &q);
      VecCopy(b, q);
      VecNormalize(q, NULL);
      Q[0] = q;
      for (PetscInt k = 1; k < n + 1; k++) {
          Vec v;
          VecDuplicate(b, &v);
          MatMult(A, Q[k - 1], v);
          PetscScalar h_ij;
          for (PetscInt j = 0; j < k; j++) {
              VecDot(Q[j], v, &h_ij);
              h[j * n + k - 1] = h_ij;
              VecAXPY(v, -h_ij, Q[j]);
          }
          VecNorm(v, NORM_2, &h_ij);
          h[k * n + k - 1] = h_ij;
          if (h_ij > eps) {
              VecNormalize(v, NULL);
              Q[k] = v;
          } else {
              break; // Early breakdown
          }
      }
  }
  ```
])

The matrix $A$ is stored in a distributed way using the PETSc #link("https://petsc.org/release/manualpages/Mat/Mat/")[`Mat`] type. The vector $b$ is also stored in a distributed way using the PETSc #link("https://petsc.org/release/manualpages/Vec/Vec/")[`Vec`] type. In the first experiments the matrix $A$ is a 3D Laplacian matrix (symmetric case) and the vector $b$ is the vector of ones.

= Numerical Experiments

For the numerical experiments we set $ell = 25$ as size for the Krylov subspace. We will use the following test problems

-  *Strong scaling* -- Fixed size problem with and varying number of processes. We expect this to start slow and get faster as we increase the number of processes up to a certain point where the overhead of communication will start to deteriorate the computation time.

- *Weak scaling* -- Varying number of processes and varying size of the problem (constant amount of work per node). We expect this to start fast and get slower as we increase the size of the problem as the cost of communication will deteriorate the computation time.

- *SuiteSparse Experiment* -- Varying number of processes on a matrix from the SuiteSparse matrix collection. We chose the matrix `atmosmodd` which has a symmetric pattern of non-zeros but is itself non-symmetric to see how the Hessenberg matrix is affected by the non-symmetry (we get a proper, non tridiagonal hessenberg matrix as with the symmetric Laplacian case).

== Experiment: 3D Laplacian

All the sparse laplacian matrices are generated using the following Julia code, this generates the 3D Laplacian discretization for a grid of side $N$. The matrix is generated sparse using the `spdiagm` function from `SparseArrays`, `kron` is the Kronecker product function from `LinearAlgebra` and `matwrite` is a function from the `MAT` package to write the matrix to a file in the HDF5 format as needed by PETSc.

#align(center, box(fill: luma(93%), radius: 5pt, inset: 5pt)[
  #set text(size: 10pt)
  #set par(leading: 4pt)
  ```jl
  using LinearAlgebra
  using SparseArrays
  using MAT

  N = parse(Int, ARGS[1])

  ex = fill(1, N - 1)
  ey = fill(1, N - 1)
  ez = fill(1, N - 1)

  Dxx = spdiagm(-1 => ex, 0 => -2 * [ex; 1], +1 => ex)
  Dyy = spdiagm(-1 => ey, 0 => -2 * [ey; 1], +1 => ey)
  Dzz = spdiagm(-1 => ez, 0 => -2 * [ez; 1], +1 => ez)

  Ix = spdiagm(0 => [ex; 1])
  Iy = spdiagm(0 => [ey; 1])
  Iz = spdiagm(0 => [ez; 1])

  L = kron(Dxx, Iy, Iz) + kron(Ix, Dyy, Iz) + kron(Ix, Iy, Dzz)

  matwrite("laplacian_$N.mat", Dict("A" => L))
  ```
])

We dropped the factor $1 slash h_i^2$ from the matrix and computed its eigenvalues for a grid of size $N times N times N$. From the previous discussion we can see that the eigenvalues of the 3D Laplacian are given by the following expression, for $i_1, i_2, i_3 = 1, dots, N$
$
  lambda_(i_1, i_2, i_3) =
  - 4 (
    cos^2((pi i_1) / (2(n_1 + 1))) +
    cos^2((pi i_2) / (2(n_2 + 1))) +
    cos^2((pi i_3) / (2(n_3 + 1)))
  )
$

The factor $cos^2(...)$ is bounded in $[0, 1]$ so the eigenvalues are bounded in $[-12, 0]$. First we can check the correctness of the implementation by checking the eigenvalues and eigenvectors returned by our implementation. In @laplacian-hessenberg we can see the resulting Hessenberg matrix while in @laplacian-schur we have the Schur form returned by `dhesqr`.

#figure(
  image("plots/cropped/laplacian-hessenberg.png", width: 75%),
  caption: [
    Resulting Hessenberg matrix for the 3D Laplacian with $N = 20$ and $ell = 25$.
  ]
) <laplacian-hessenberg>

#figure(
  image("plots/cropped/laplacian-schur.png", width: 75%),
  caption: [
    Resulting Schur matrix for the 3D Laplacian with $N = 20$ and $ell = 25$ after calling `dhesqr`
  ]
) <laplacian-schur>

And we get the following eigenvalues.

$
Lambda = {
  && -11.73,
  && -11.43,
  && -11.07,
  && -10.64,
  && -10.13, \
  && -9.55,
  && -8.91,
  && -8.21,
  && -7.47,
  && -6.82, \
  && -6.16,
  && -5.49,
  && -4.81,
  && -4.11,
  && -3.59, \
  && -3.09,
  && -2.64,
  && -2.16,
  && -1.61,
  && -1.12, \
  && -0.91,
  && -0.6,
  && -0.43,
  && -0.24,
  && -0.07
}
$

=== Strong Scaling

First we tried to run the program starting from $N = 20$ and increasing the number of nodes by doubling. As we can see in @laplacian-strong-scaling-3 the performance starts bad but starts to improve as we increase the size of the problem (initially the problem is too small to be able to take advantage of the parallelism and the communication overhead is too high).

#figure(
  image("plots/cropped/laplacian-strong-scaling-3.png", width: 75%),
  caption: [
    Times for varying node count with $N in {20, 40, 80}$, $ell = 25$.
  ]
) <laplacian-strong-scaling-3>

We get our final result in @laplacian-strong-scaling-4 for problem size up to $N = 160$ and we can see that the performance is quite good. The log-log plot in @laplacian-strong-scaling-4-loglog better shows the performance in all the range of problem sizes.

#figure(
  image("plots/cropped/laplacian-strong-scaling-4.png", width: 75%),
  caption: [
    Times for varying node count with $N in {20, 40, 80, 160}$ and $ell = 25$.
  ]
) <laplacian-strong-scaling-4>

#figure(
  image("plots/cropped/laplacian-strong-scaling-4-loglog.png", width: 75%),
  caption: [
    Times for varying node count with $N in {20, 40, 80, 160}$ and $ell = 25$.
  ]
) <laplacian-strong-scaling-4-loglog>

From these results we can also see how the performance stabilizes as we increase the node count, at a certain point adding more nodes does not improve as much the performance.

=== Weak Scaling

For the weak scaling we generated the matrices for $N = ceil(root(3, i times 20000))$ for $i in {1, ..., 25}$ to get a _constant amount of work per node_. We can see in @laplacian-weak-scaling that the performance degrades as we increase the number of nodes. This is expected as the problem size increases and the communication overhead becomes more significant.

#figure(
  image("plots/cropped/laplacian-weak-scaling.png", width: 75%),
  caption: [
    Times with $N = ceil(root(3, i times 20000)) = {28, ..., 80}$ for $i in {1, ..., 25}$ and $ell = 25$.
  ]
) <laplacian-weak-scaling>

== Experiment: `atmosmodd` from SuiteSparse

The matrix `atmosmodd` is a non-symmetric matrix with a symmetric pattern of non-zeros.

[TODO: More info on the matrix]

In @atmosmodd-hessenberg we can see that the resulting Hessenberg matrix looks tridiagonal but when looking at the log-log plot we actually see that the matrix is a proper Hessenberg matrix. The resulting Schur matrix is shown in @atmosmodd-schur and the computed eigenvalues are the following

$
  Lambda = {
    && -187078.67,
    && -181294.17,
    && -173634.23,
    && -167725.76,
    && -160474.49, \
    && -152024.13,
    && -142505.34,
    && -132037.86,
    && -120709.76,
    && -108473.08, \
    && -93768.3,
    && -83850.61,
    && -75095.74,
    && -65634.56,
    && -56074.01, \
    && -46728.52,
    && -37822.94,
    && -29538.87,
    && -22025.98,
    && -15371.79, \
    && -9907.26,
    && -5664.09,
    && -2596.04,
    && -739.49,
    && -60.07
  }
$

#figure(
  image("plots/cropped/suitesparse-atmosmodd-hessenberg.png", width: 75%),
  caption: [
    Resulting Hessenberg matrix for the SuiteSparse matrix `atmosmodd` with $ell = 25$.
  ]
) <atmosmodd-hessenberg>

#figure(
  image("plots/cropped/suitesparse-atmosmodd-schur.png", width: 75%),
  caption: [
    Resulting Schur matrix for the SuiteSparse matrix `atmosmodd` with $ell = 25$ after calling `dhesqr`
  ]
) <atmosmodd-schur>

Finally we can see the same strong scaling performance for this non-symmetric matrix as for symmetric the 3D Laplacian.

#figure(
  image("plots/cropped/suitesparse-atmosmodd.png", width: 75%),
  caption: [
    Times of the Arnoldi method for the SuiteSparse matrix `atmosmodd` with $ell = 25$ with varying number of nodes.
  ]
) <atmosmodd-varying-nodecount>