Visual intuition behind SVD

7 min readJun 20, 2020

SVD is arguably one of the most popular matrix factorization approach.

It represents any matrix A of size (m × n) as a product of 3 matrices: UΣVᵀ, where:

U is an (m × m) orthogonal matrix
Σ is an (m × n) diagonal matrix of singular values
V is an (n ×n) orthogonal matrix

For me it was always easy to believe that we can do it, but I felt puzzeled with questions like: why 3 matrices? why singular values? why are these sizes? why would one want to decompose a matrix like this anyway?

Looking through the internet I found a lot of articles and videos about how to calculate SVD, but was not able to find one providing good visual intuition of what’s going on. So here I will not talk about how to actually find these matrices (you won’t do it by hand anyway, right), but will try to answer the questions I was puzzeled with when I first encounter SVD in a visual way. I’ll try to avoid special terms and formulas as much as I can.

But what is a martix?

So the first thing first: what is a visual representation of a matrix? We can think of a matrix as a table (an array) each row containing coordinates of a point, then it represents a set of points by their coordinates. We can easily visualize this if there’re no more than 3 coordinated per point.

A = np.array([
    [0, 0, 0],
    [2, 2, 1],
    [1, 0, 2],
    [3, 4, 3]])

Want to read this story later? Save it in Journal.

So each point is represented by 3 coordinates in 3-dimentional space. And these coordinates essentially say how much to go along each axis from the origin to get to that point.

Now let’s use iris dataset for our experiments, it has 4-dimensional data, but we can simply ignore one of parameters, to simplify visualization.

Let’s use these columns as our parameters

X_train = df[['petal_length', 'petal_width', 'sepal_width']].values

Now we can visualize this data as points in 3D space

Here color just represents different class of objects in the dataset, and we can easily see why visual representation is so cool — we can immediately see that similar objects are clustered.

Now let me state the following: SVD effectively finds another orthonormal basis in our space and the representation of original matrix in that space.

Change of basis

So we can freely change basis and then we’ll need new coordinates that would tell us how much we need to go along these new directions we choose, to get to each point.

Let’s consider new basis:

T = np.array([
    [1, 0, 1],
    [0, 1, 1],
    [0, 0, 1],
])

So the new basis has 2 vectors from the standard one (namely i and j), but it has a vector (1, 1, 1) instead of standard vector k

The thin lines on the picture show how much do you need to go along each of new axes from the origin to get to a given point. Note that apart from directions basis also defines the size of “standard step along the axis” and what coordinate tells us is how many these steps we need to take.

What is orthonormal basis?

In our terms it’s such a basis, where when you move along one axis, you do not move along any other. In standard basis if you move along k (e.g. up or down), your horisontal position does not change. However in a basis from our example if you move along blue vector you will also move horisontally. One useful consequence from this property is that if we want to project a point to a plane spanned by these axes (one of those grey “walls” on the pictures) we just need to set corresponding coordinate(s) to 0.
The “standard step along the axis” is one for all axes.

Ok, let’s see what it looks like for our Iris dataset.

As we can see the vectors are orthogonal (yet in other words, perpendicular to each other).

These vectors are exactly what matrix V gives us! In particular each column of that matrix tells us the coordinates of new vector in the standard basis.

Scaling

The matrix Σ has non-zero elements only on it’s diagonal and all it tells us, is how much do we need to stretch the dots along each axis. Note that by construction all elements are positive and so it’s always stretching, not squeezing. Also it’s diagonal elements are sorted in decreasing order, which means that we will stetch data along our first axis the most, and with each next axis will stretch less and less. This implies that original data is mostly distributed along the first of our new axes.

Here we can compare how data is distributed along our 1st and 3rd axes.

New coordinates

Now when we have the new basis and scaling, we need to find how to express our data a new way. We need to know how many steps we need to take along new axes to get to all of our dots. This is what matrix U tells us. It essentially contains new coordinates for all our dots.

Let’s see how our data looks in the new world.

This is not stretched values (note very small values for coordinates).

Essentially this is just plotting first 3 columns of U just as we plotted the original data.

After stretching it becomes

Not much of a difference until we pay attention to the axes scales, clearly x has the biggest range, then goes y, and z was almost untouched by scaling.

This is what we get by (ΣUᵀ)ᵀ. Remember that Linear Algebra treats point coordinates as a column-vector and in the data each entry is usually represented by a row, thus all these ᵀ.

Reconstructing original matrix

Then if we apply backward conversion to our original coordinates:

(VΣUᵀ)ᵀ = UΣVᵀ = A

So we first transponde U to make columns represent points, then apply all the transformations and then transponde back to “normal” data format. As Σ is diagonal Σᵀ = Σ (technically if it’s not square, it will change it’s shape, but the main part, filled with values remains the same).

Buy why?

Sometimes we don’t really need all the information we have in data.

Remember this picture?

Here the distribution along our 3rd basis vector gives us almost nothing — all points are almost alined. So why don’t we just drop this information? We will then get a 2D picture of the same data. 2D is easier to look at, easier to store and easier to process.

Here it is:

Let’s also take a look at what it actually means in terms of original data

Crosses are the projected data points with thin cyan lines tracing the original positions.

Here we can clearly see all the dots were projected on the plane formed by 2 remaining basis vectors

And looking from this angle the pattern seen on the 2D plot is clearly recognizible (though I’m bad as a cameraman).

Quick recap

So SVD gives us another point of view at the data, where data is most distributed along (usually several) first axes.

Given the formula A = UΣVᵀ

Columns of Vᵀ contains the vectors forming new basis (thus shape n× n, as we express n vectors in n dimensions).

Rows of U contains our data expressed in terms of new basis (we actually need only n columns, if m > n, and e.g. numpy.linalg.svd() indeed returns U as m × n matrix, but to fulfill the statement to be orthogonal it can be filled up to m × m).

Elements on the main diagonal of Σ tells how much to stretch the U along each of new axes.

If we only use some subset of the new vectors to represent our data, we effectively project the data onto the (hyper-)plane formed by the remaining vectors (remember that useful property of orthogonal basises?). It’s a useful technic to reduce dimensionality and computational complexity as well as to represent multi-dimensional data (we can hardly draw 4D data, and 10D is absolutely above our imagination, yet this technic may allow us to visually spot some dependencies).

Conceptually, if we speak about some data entries described by a set of features, we can think of SVD as creating new set of features by combining original ones, which (hopefully) better describe the data. Like instead of speaking of length and width for screens we combine them and speak in terms of diagonal size, and this single feature is good enough.

📝 Save this story in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.