MINRES and conjugate gradients

8.6. MINRES and conjugate gradients#

We have seen before that certain matrix properties enhance solutions to linear algebra problems. One of the most important of these is when $\mathbf{A}^*=\mathbf{A}$; i.e., $\mathbf{A}$ is hermitian. The Arnoldi iteration has a particularly useful specialization to this case. While in this section we describe the resulting algorithms, we do not present them in detail or show implementations.

Lanczos iteration#

Starting from (8.4.6), we left-multiply by $\mathbf{Q}_m^*$ to get

\[\mathbf{Q}_m^* \mathbf{A} \mathbf{Q}_m = \mathbf{Q}_m^* \mathbf{Q}_{m+1} \mathbf{H}_m = \tilde{\mathbf{H}}_m,\]

where $\tilde{\mathbf{H}}_m$ is rows 1 through $m$ of $\mathbf{H}_m$. If $\mathbf{A}$ is hermitian, then so is the left side of this equation, hence $\tilde{\mathbf{H}}_m$ is hermitian too. But it is also upper Hessenberg, meaning that the $(i,j)$ element is zero if $i > j+1$. By symmetry, this means that elements are zero when $j > i+1$ as well.

Observation 8.6.1

For a hermitian (or real symmetric) matrix, the upper Hessenberg matrix $\mathbf{H}_m$ produced by the Arnoldi iteration is tridiagonal.

Equation (8.4.3) of the Arnoldi iteration now simplifies to a much shorter expression:

(8.6.1)#\[\mathbf{A} \mathbf{q}_m = H_{m-1,m} \,\mathbf{q}_{m-1} + H_{mm} \,\mathbf{q}_m + H_{m+1,m}\,\mathbf{q}_{m+1}.\]

As before in deriving the Arnoldi iteration, when given the first $m$ vectors we can solve for the entries in column $m$ of $\mathbf{H}$ and then for $\mathbf{q}_{m+1}$. The resulting process is known as the Lanczos iteration. Its most important practical advantage is that while Arnoldi needs $O(m)$ steps to get $\mathbf{q}_{m+1}$ from the previous vectors, Lanczos needs only $O(1)$ steps, so restarting isn’t required for symmetric matrices.1

MINRES#

When $\mathbf{A}$ is hermitian and the Arnoldi iteration is reduced to Lanczos, the analog of GMRES is known as MINRES. Like GMRES, MINRES minimizes the residual $\|\mathbf{b}-\mathbf{A}\mathbf{x}\|$ over increasingly larger Krylov spaces.

MINRES is also more theoretically tractable than GMRES. The following result relies on some advanced approximation theory. Recall that the eigenvalues of a hermitian matrix are real.

Theorem 8.6.2 : Convergence of MINRES (indefinite case)

Suppose $\mathbf{A}$ is hermitian, invertible, and indefinite. Divide its eigenvalues into positive and negative sets $\Lambda_+$ and $\Lambda_-$, and define

\[ \kappa_+ = \frac{ \max_{\lambda \in \Lambda_+} |\lambda| }{ \min_{\lambda \in \Lambda_+} |\lambda| }, \qquad \kappa_- = \frac{ \max_{\lambda \in \Lambda_-} |\lambda| }{ \min_{\lambda \in \Lambda_-} |\lambda| }. \]

Then $\mathbf{x}_m$, the $m$th solution estimate of MINRES, satisfies

(8.6.2)#\[\frac{\|\mathbf{r}_m\|_2}{\|\mathbf{b}\|_2} \le \left( \frac{\sqrt{\kappa_+\kappa_-} - 1}{\sqrt{\kappa_+\kappa_-} + 1} \right)^{\lfloor m/2\rfloor},\]

where $\lfloor m/2\rfloor$ means to round $m/2$ down to the nearest integer.

The bound for a definite matrix is better, as the next theorem shows. The upper bound (8.6.2) on the residual obeys a linear convergence rate. As the product $\kappa_+\kappa_-$ grows, the rate of this convergence approaches 1. Hence the presence of eigenvalues close to the origin (relative to the max eigenvalues) is expected to force a slower convergence.

Example 8.6.3

Suppose $\mathbf{A}$ has $\kappa_+=60$ and $\kappa_-=15$. Then to achieve a guaranteed reduction in the relative residual of $10^{-3}$, we require

\[ \left( \frac{\sqrt{900} - 1}{\sqrt{900} + 1} \right)^{\lfloor m/2\rfloor} \le 10^{-3}, \]

\[ {\lfloor m/2\rfloor} \log_{10} \left( \frac{29}{31} \right) \le -3, \]

\[ m \ge 2 \lceil \frac{3}{\log_{10}(29/31)} \rceil = 208. \]

Because the theorem gives an upper bound, MINRES may converge faster. All we can say is that 208 is certain to be enough iterations.

Demo 8.6.4

The following matrix is indefinite.

A = FNC.poisson(10) - 20I
λ = eigvals(Matrix(A))
isneg = @. λ < 0
@show sum(isneg),sum(.!isneg);

(sum(isneg), sum(.!(isneg))) = (13, 87)

We can compute the relevant quantities from Theorem 8.6.2.

mn,mx = extrema(-λ[isneg])
κ₋ = mx/mn
mn,mx = extrema(λ[.!isneg])
κ₊ = mx/mn
ρ = (sqrt(κ₋*κ₊)-1) / (sqrt(κ₋*κ₊)+1)

0.9026418585584018

Because the iteration number $m$ is halved in (8.6.2), the rate constant of linear convergence is the square root of this number, which makes it even closer to 1.

Now we apply MINRES to a linear system with this matrix, and compare the observed convergence to the upper bound from the theorem.

b = rand(100)
x,hist = minres(A,b,reltol=1e-10,maxiter=51,log=true);

relres = hist[:resnorm] / norm(b)
m = 0:length(relres)-1
plot(m,relres,label="observed",leg=:left,
	xaxis=L"m",yaxis=(:log10,"relative residual"),
	title=("Convergence of MINRES") )
plot!(m,ρ.^(m/2),l=:dash,label="upper bound")

The upper bound turns out to be pessimistic here, especially in the later iterations. However, you can see a slow linear phase in the convergence that tracks the bound closely.

Conjugate gradients#

Given positive definiteness in addition to symmetry, we arrive at perhaps the most famous Krylov subspace method for $\mathbf{A}\mathbf{x}=\mathbf{b}$, called conjugate gradients.

Suppose now that $\mathbf{A}$ is hermitian and positive definite (HPD). Then $\mathbf{A}$ has a Cholesky factorization, which in the complex case is $\mathbf{A}=\mathbf{R}^*\mathbf{R}$. Therefore, for any vector $\mathbf{u}$,

\[ \mathbf{u}^*\mathbf{A}\mathbf{u} = (\mathbf{R}\mathbf{u})^*(\mathbf{R}\mathbf{u})=\|\mathbf{R} \mathbf{u}\|^2, \]

which is nonnegative and zero only when $\mathbf{u}=\boldsymbol{0}$, provided $\mathbf{A}$ (and therefore $\mathbf{R}$) is nonsingular. Hence we can define a special vector norm relative to $\mathbf{A}$:

(8.6.3)#\[\| \mathbf{u} \|_{\mathbf{A}} = \left( \mathbf{u}^*\mathbf{A}\mathbf{u} \right)^{1/2}.\]

Definition 8.6.5 : Method of conjugate gradients (CG)

For each $m=1,2,3,\ldots$, minimize $\|\mathbf{x}_m-\mathbf{x}\|_{\mathbf{A}}$ for $\mathbf{x}$ in the Krylov subspace $\mathcal{K}_m$.

Convergence#

The convergence of CG and MINRES is dependent on the eigenvalues of $\mathbf{A}$. In the HPD case the eigenvalues are real and positive, and they equal the singular values. Hence the condition number $\kappa$ is equal to the ratio of the largest eigenvalue to the smallest one. The following theorem suggests that MINRES and CG are not so different in convergence.

Theorem 8.6.6 : MINRES and CG convergence (definite case)

Let $\mathbf{A}$ be real and SPD with 2-norm condition number $\kappa$. For MINRES define $R(m)=\|\mathbf{r}_m\|_2/\|\mathbf{b}\|_2$, and for CG define $R(m)=\|\mathbf{x}_m-\mathbf{x}\|_{\mathbf{A}}/\|\mathbf{x}\|_{\mathbf{A}}$, where $\mathbf{r}_m$ and $\mathbf{x}_m$ are the residual and solution approximation associated with the space $\mathcal{K}_m$. Then

(8.6.4)#\[R(m) \le 2\, \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^m.\]

Theorem 8.6.6 characterizes the convergence of MINRES and CG similarly, differing only in whether the measurement is of the residual or the $\mathbf{A}$-norm of the error, respectively. While these are different quantities, in practice one may not find a consistent advantage for one method over the other.

As in the indefinite case with MINRES, a larger condition number is associated with slower convergence in the positive definite case. Specifically, to make the bound in (8.6.4) less than a number $\epsilon$ requires

\[\begin{gather*} 2 \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^m \approx \epsilon, \\ m \log \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right) \approx \log\Bigl( \frac{\epsilon}{2} \Bigr). \end{gather*}\]

We estimate

\[\begin{align*} \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} &= (1 - \kappa^{-1/2}\,) (1 + \kappa^{-1/2}\,)^{-1}\\ &= (1 - \kappa^{-1/2}\,) (1 - \kappa^{-1/2} + \kappa^{-1} + \cdots)\\ &= 1 - 2\kappa^{-1/2} + O(\kappa^{-1}) \quad \text{ as $\kappa \rightarrow \infty$.} \end{align*}\]

With the Taylor expansion $\log(1+x) = x - (x^2/2) + \cdots$, we finally conclude

\[\begin{gather*} 2 m \kappa^{-1/2} \approx \log\Bigl( \frac{\epsilon}{2} \Bigr), \text{ or } m = O(\sqrt{\kappa}), \end{gather*}\]

as an estimate of the number of iterations needed to achieve a fixed accuracy.

Observation 8.6.7

As a rule of thumb, the number of iterations required for MINRES or CG to converge is $O(\sqrt{\kappa})$, where $\kappa$ is the condition number.

This estimate fails for very large $\kappa$, however.

Demo 8.6.8

We will compare MINRES and CG on some quasi-random SPD problems. The first matrix has a condition number of 100.

n = 5000
density = 0.001
A = FNC.sprandsym(n,density,1/100)
x = (1:n)/n
b = A*x;

Now we apply both methods and compare the convergence of the system residuals, using implementations imported from IterativeSolvers.

plt = plot(title="Convergence of MINRES and CG",
    xaxis=("Krylov dimension"),yaxis=(:log10,"relative residual norm"))
for method in [minres,cg]
    x̃,history = method(A,b,reltol=1e-6,maxiter=1000,log=true);
    relres = history[:resnorm] / norm(b)
    plot!(0:length(relres)-1,relres,label="$method")
    err = round( norm( x̃ - x ) / norm(x), sigdigits=4 )
    println("$method error: $err")
end
plt

minres error: 1.185e-5

cg error: 4.112e-6

There is little difference between the two methods here. Next, we increase the condition number of the matrix by a factor of 25. The rule of thumb predicts that the number of iterations required should increase by a factor of about 5.

A = FNC.sprandsym(n,density,1/2500)
b = A*x;

minres error: 0.0002557
cg error: 4.121e-5

Both methods have an early superlinear phase that allow them to finish slightly sooner than the factor of 5 predicted: Theorem 8.6.6 is an upper bound, not necessarily an approximation. Both methods ultimately achieve the same reduction in the residual; MINRES stops earlier, but with a slightly larger error.

Exercises#

✍ For each part, the eigenvalues of $\mathbf{A}$ are given. Suppose MINRES is applied to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$. Use (8.6.2) or (8.6.4), whichever is most appropriate, to determine a lower bound on $m$ to guarantee reduction of the residual norm by a factor $10^{-4}$.

(a) $-100,-99,\ldots,-1,1,2,\ldots,100$

(b) $-100,1,2,\ldots,100$

(c) $1,2,\ldots,100$
⌨ Let $\mathbf{b}$ be a random unit vector of length 200. Define the matrix
```
u = range(-200,-5,length=100); 
v = range(10,100,length=100);
A = diagm([u;v]);
```
(a) Apply 120 iterations of minres to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$. Compute the relative error of the answer, and plot the norm of the residual as a function of $m$ on a log-linear scale.

(b) Add to your graph the line representing the upper bound (8.6.2). (Ignore the rounding in the exponent.) This line should stay strictly on or above the convergence curve.
⌨ Let $\mathbf{b}$ be a random unit vector of length 500. Define the matrix
```
A = spdiagm(range(4,10000,length=500));
```
(a) Apply 100 iterations of minres to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$. Compute the relative norm of the answer. Plot the norm of the residual as a function of $m$.

(b) Add to your graph the line representing the upper bound (8.6.4). This line should stay strictly on or above the convergence curve.

(c) Add a convergence curve for 100 iterations of cg.
✍ Suppose a family of SPD matrices $\mathbf{A}$ is parameterized by $t$, and that the condition numbers of the matrices scale like $O(t^2)$ as $t\to\infty$. Given that CG takes 60 iterations to reach a certain reduction in the error of a linear system when $t=200$, estimate the number of iterations CG will need to reach the same accuracy at $t=300$.
✍ Given real $n\times n$ symmetric $\mathbf{A}$ and vector $\mathbf{b}=\mathbf{A}\mathbf{x}$, we can define the scalar-valued function

\[\varphi(\mathbf{u}) = \mathbf{u}^T \mathbf{A} \mathbf{u} - 2 \mathbf{u}^T \mathbf{b}, \qquad \mathbf{u}\in\mathbb{R}^n.\]

(a) Expand and simplify the expression $\varphi(\mathbf{x}+\mathbf{v})-\varphi(\mathbf{x})$, keeping in mind that $\mathbf{A}\mathbf{x}=\mathbf{b}$.

(b) Using the result of part (a), prove that if $\mathbf{A}$ is an SPD matrix, $\varphi$ has a global minimum at $\mathbf{x}$.

(c) Show that for any vector $\mathbf{u}$, $\|\mathbf{u}-\mathbf{x}\|_{\mathbf{A}}^2-\varphi(\mathbf{u})$ is constant.

(d) Using the result of part (c), prove that CG minimizes $\varphi(\mathbf{u})$ over Krylov subspaces.
⌨ The following linear system arises from the Helmholtz equation for wave propagation:
```
A = FNC.poisson(n) - k^2*I
b = -ones(n^2)
```
(a) Apply both MINRES and CG to the linear system for $n=50$ and $k=1.3$, solving to a relative residual tolerance of $10^{-5}$. Plotting their convergence curves together.

(b) Repeat part (a) for $k=8$.

(c) Use eigs on the matrix from part (b) to show that it is indefinite. (Hint: Use which=:SR and which=:LR to get the smallest real and largest real values.) This helps explain why the CG convergence curve for this matrix looks rather strange.

1: In principle, the implementation of Lanczos iteration is minor change from Arnoldi, but numerical stability requires some extra analysis and effort. We do not present the details.