7.5. Dimension reduction#

The SVD has another important property that proves very useful in a variety of applications. Let \(\mathbf{A}\) be a real \(m\times n\) matrix with SVD \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) and (momentarily) \(m\ge n\). Another way of writing the thin form of the SVD is

(7.5.1)#\[\begin{split}\begin{split} \mathbf{A} = \hat{\mathbf{U}}\hat{\mathbf{S}}\mathbf{V}^T &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \end{bmatrix} \: \begin{bmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_n \end{bmatrix} \: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix}\ \\ &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \sigma_1\mathbf{u}_1 & \cdots & \sigma_n\mathbf{u}_n \end{bmatrix}\: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix} \\ &= \sigma_1 \mathbf{u}_{1}\mathbf{v}_{1}^T + \cdots + \sigma_r \mathbf{u}_{r}\mathbf{v}_{r}^T = \sum_{i=1}^r \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T, \end{split}\end{split}\]

where \(r\) is the rank of \(\mathbf{A}\). The final formula also holds for the case \(m<n\).

Each outer product \(\mathbf{u}_{i}\mathbf{v}_{i}^T\) is a rank-1 matrix of unit 2-norm. Thanks to the ordering of singular values, then, Equation (7.5.1) expresses \(\mathbf{A}\) as a sum of decreasingly important contributions. This motivates the definition, for \(1\le k \le r\),

(7.5.2)#\[\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T,\]

where \(\mathbf{U}_k\) and \(\mathbf{V}_k\) are the first \(k\) columns of \(\mathbf{U}\) and \(\mathbf{V}\), respectively, and \(\mathbf{S}_k\) is the upper-left \(k\times k\) submatrix of \(\mathbf{S}\).

The rank of a sum of matrices is always less than or equal to the sum of the ranks, so \(\mathbf{A}_k\) is a rank-\(k\) approximation to \(\mathbf{A}\). It turns out that \(\mathbf{A}_k\) is the best rank-\(k\) approximation of \(\mathbf{A}\), as measured in the matrix 2-norm.

Theorem 7.5.1

Suppose \(\mathbf{A}\) has rank \(r\) and let \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) be an SVD. Let \(\mathbf{A}_k\) be as in (7.5.2) for \(1\le k < r\). Then

  1. \(\| \mathbf{A} - \mathbf{A}_k \|_2 = \sigma_{k+1}, \quad k=1,\ldots,r-1\), and

  2. If the rank of \(\mathbf{B}\) is \(k\) or less, then \(\| \mathbf{A}-\mathbf{B} \|_2\ge \sigma_{k+1}\).


(part 1 only) Note that (7.5.2) is identical to (7.5.1) with \(\sigma_{k+1},\ldots,\sigma_r\) all set to zero. This implies that

\[\mathbf{A} - \mathbf{A}_k = \mathbf{U}(\mathbf{S}-\hat{\mathbf{S}})\mathbf{V}^T,\]

where \(\hat{\mathbf{S}}\) has those same values of \(\sigma_i\) replaced by zero. But that makes the above an SVD of \(\mathbf{A} - \mathbf{A}_k\), with singular values \(0,\ldots,0,\sigma_{k+1},\ldots,\sigma_r\), the largest of which is \(\sigma_{k+1}\). That proves the first claim.


If the singular values of \(\mathbf{A}\) decrease sufficiently rapidly, then \(\mathbf{A}_{k}\) may capture the most significant behavior of the matrix for a reasonably small value of \(k\).

Demo 7.5.2

We make an image from some text, then reload it as a matrix.

plot(annotations=(0.5,0.5,text("Hello world",44,:center,:middle)),
img = load("hello.png")
A = @. Float64(Gray(img))

Next we show that the singular values decrease until they reach zero (more precisely, until they are about \(\epsilon_\text{mach}\) times the norm of the matrix) at around \(k=45\).

U,σ,V = svd(A)
scatter(σ,xaxis=(L"i"), yaxis=(:log10,L"\sigma_i"),
    title="Singular values")
Singular values

The rapid decrease suggests that we can get fairly good low-rank approximations.

plt = plot(layout=(2,2),frame=:none,aspect_ratio=1,titlefontsize=10)
for i in 1:4
    k = 3i
    Ak = U[:,1:k]*diagm(σ[1:k])*V[:,1:k]'
    plot!(Gray.(Ak),subplot=i,title="rank = $k")
rank = 3 rank = 6 rank = 9 rank = 12

Consider how little data is needed to reconstruct these images. For rank-9, for instance, we have 9 left and right singular vectors plus 9 singular values, for a compression ratio of better than 12:1.

m,n = size(A)
compression = m*n / (9*(m+n+1))


  1. ✍ Suppose that \(\mathbf{A}\) is an \(n\times n\) matrix. Explain why \(\sigma_n\) is the distance (in 2-norm) from \(\mathbf{A}\) to the set of all singular matrices.

  2. ✍ Suppose \(\mathbf{A}\) is a \(7\times 4\) matrix and the eigenvalues of \(\mathbf{A}^*\mathbf{A}\) are 3, 4, 7, and 10. How close is \(\mathbf{A}\) in the 2-norm to (a) a rank-3 matrix? (b) a rank-2 matrix?

  3. (a) ⌨ Find the rank-1 matrix closest to

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & 5 \\ 5 & 1 \end{bmatrix}, \end{split}\]

    as measured in the 2-norm.

    (b) ⌨ Repeat part (a) for

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & 5 \\ 0 & 1 \end{bmatrix}. \end{split}\]
  4. ✍ Find the rank-1 matrix closest to

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & b \\ b & 1 \end{bmatrix}, \end{split}\]

    as measured in the 2-norm, where \(b>0\).

  5. ⌨ Following Demo 7.5.2 as a guide, load the “mandrill” test image and convert it to a matrix of floating-point pixel grayscale intensities. Using the SVD, display as images the best approximations of rank 5, 10, 15, and 20.


In statistics this quantity may be interpreted as the fraction of explained variance.