7.5. Dimension reduction#

The SVD has another important property that proves very useful in a variety of applications. Let \(\mathbf{A}\) be a real \(m\times n\) matrix with SVD \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) and (momentarily) \(m\ge n\). Another way of writing the thin form of the SVD is

(7.5.1)#\[\begin{split}\begin{split} \mathbf{A} = \hat{\mathbf{U}}\hat{\mathbf{S}}\mathbf{V}^T &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \end{bmatrix} \: \begin{bmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_n \end{bmatrix} \: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix}\ \\ &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \sigma_1\mathbf{u}_1 & \cdots & \sigma_n\mathbf{u}_n \end{bmatrix}\: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix} \\ &= \sigma_1 \mathbf{u}_{1}\mathbf{v}_{1}^T + \cdots + \sigma_r \mathbf{u}_{r}\mathbf{v}_{r}^T = \sum_{i=1}^r \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T, \end{split}\end{split}\]

where \(r\) is the rank of \(\mathbf{A}\). The final formula also holds for the case \(m<n\).

Each outer product \(\mathbf{u}_{i}\mathbf{v}_{i}^T\) is a rank-1 matrix of unit 2-norm. Thanks to the ordering of singular values, then, Equation (7.5.1) expresses \(\mathbf{A}\) as a sum of decreasingly important contributions. This motivates the definition, for \(1\le k \le r\),

(7.5.2)#\[\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T,\]

where \(\mathbf{U}_k\) and \(\mathbf{V}_k\) are the first \(k\) columns of \(\mathbf{U}\) and \(\mathbf{V}\), respectively, and \(\mathbf{S}_k\) is the upper-left \(k\times k\) submatrix of \(\mathbf{S}\).

The rank of a sum of matrices is always less than or equal to the sum of the ranks, so \(\mathbf{A}_k\) is a rank-\(k\) approximation to \(\mathbf{A}\). It turns out that \(\mathbf{A}_k\) is the best rank-\(k\) approximation of \(\mathbf{A}\), as measured in the matrix 2-norm.

Theorem 7.5.1

Suppose \(\mathbf{A}\) has rank \(r\) and let \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) be an SVD. Let \(\mathbf{A}_k\) be as in (7.5.2) for \(1\le k < r\). Then

  1. \(\| \mathbf{A} - \mathbf{A}_k \|_2 = \sigma_{k+1}, \quad k=1,\ldots,r-1\), and

  2. If the rank of \(\mathbf{B}\) is \(k\) or less, then \(\| \mathbf{A}-\mathbf{B} \|_2\ge \sigma_{k+1}\).

Proof

(part 1 only) Note that (7.5.2) is identical to (7.5.1) with \(\sigma_{k+1},\ldots,\sigma_r\) all set to zero. This implies that

\[\mathbf{A} - \mathbf{A}_k = \mathbf{U}(\mathbf{S}-\hat{\mathbf{S}})\mathbf{V}^T,\]

where \(\hat{\mathbf{S}}\) has those same values of \(\sigma_i\) replaced by zero. But that makes the above an SVD of \(\mathbf{A} - \mathbf{A}_k\), with singular values \(0,\ldots,0,\sigma_{k+1},\ldots,\sigma_r\), the largest of which is \(\sigma_{k+1}\). That proves the first claim.

Compression#

If the singular values of \(\mathbf{A}\) decrease sufficiently rapidly, then \(\mathbf{A}_{k}\) may capture the most significant behavior of the matrix for a reasonably small value of \(k\).

Demo 7.5.2

We make an image from some text, then reload it as a matrix.

plot(annotations=(0.5,0.5,text("Hello world",44,:center,:middle)),
    grid=:none,frame=:none,size=(400,150))
savefig("hello.png")
img = load("hello.png")
A = @. Float64(Gray(img))
Gray.(A)
../_images/dimreduce_2_0.png

Next we show that the singular values decrease until they reach zero (more precisely, until they are about \(\epsilon_\text{mach}\) times the norm of the matrix) at around \(k=45\).

U,σ,V = svd(A)
scatter(σ,xaxis=(L"i"), yaxis=(:log10,L"\sigma_i"),
    title="Singular values")
Singular values

The rapid decrease suggests that we can get fairly good low-rank approximations.

plt = plot(layout=(2,2),frame=:none,aspect_ratio=1,titlefontsize=10)
for i in 1:4
    k = 3i
    Ak = U[:,1:k]*diagm(σ[1:k])*V[:,1:k]'
    plot!(Gray.(Ak),subplot=i,title="rank = $k")
end
plt
rank = 3 rank = 6 rank = 9 rank = 12

Consider how little data is needed to reconstruct these images. For rank-9, for instance, we have 9 left and right singular vectors plus 9 singular values, for a compression ratio of better than 12:1.

m,n = size(A)
compression = m*n / (9*(m+n+1))
12.099213551119178

Exercises#

  1. ✍ Suppose that \(\mathbf{A}\) is an \(n\times n\) matrix. Explain why \(\sigma_n\) is the distance (in 2-norm) from \(\mathbf{A}\) to the set of all singular matrices.

  2. ✍ Suppose \(\mathbf{A}\) is a \(7\times 4\) matrix and the eigenvalues of \(\mathbf{A}^*\mathbf{A}\) are 3, 4, 7, and 10. How close is \(\mathbf{A}\) in the 2-norm to (a) a rank-3 matrix? (b) a rank-2 matrix?

  3. (a) ⌨ Find the rank-1 matrix closest to

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & 5 \\ 5 & 1 \end{bmatrix}, \end{split}\]

    as measured in the 2-norm.

    (b) ⌨ Repeat part (a) for

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & 5 \\ 0 & 1 \end{bmatrix}. \end{split}\]
  4. ✍ Find the rank-1 matrix closest to

    \[\begin{split} \mathbf{A}=\displaystyle \begin{bmatrix} 1 & b \\ b & 1 \end{bmatrix}, \end{split}\]

    as measured in the 2-norm, where \(b>0\).

  5. ⌨ Following Demo 7.5.2 as a guide, load the “mandrill” test image and convert it to a matrix of floating-point pixel grayscale intensities. Using the SVD, display as images the best approximations of rank 5, 10, 15, and 20.


1

In statistics this quantity may be interpreted as the fraction of explained variance.