PCA Explained

In Machine Learning and Statistics, Principal Component Analysis (PCA or linear PCA) is a classic and widely used technique for data transformation. PCA can, for example, transform your high dimensional data into points in two or three dimensional space so that they can be easily visualized in a scatter plot, or it could be a preprocessing step in supervised learning tasks (classification, regression) so that the data is projected in low dimensional space where the new features are decorrelated, leading to potentially better accuracy. This tutorial will explain all the math behind PCA and provide an example where you’ll practice step-by-step how to perform PCA on a synthetic data matrix.

Prerequisite

Here it is assumed that you have some basic understanding of Linear Algebra (e.g. matrix arithmetics, the geometric interpretation of inner product), Calculus (e.g. Lagrangian Multiplier) and Statistics (e.g. what is mean and variance). In addition, definitions of some math concepts and theorems (listed below) are needed to prove the properties of the resulting transformed data matrix.

Definition 1

A scalar $\lambda$ is called an eigenvalue of a $n \times n$ matrix $\mathbf{A}$ if there is a nontrivial (non-zero) solution $x$ of $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$ . Such an $\mathbf{x}$ is called an eigenvector corresponding to the eigenvalue $\lambda$ .

Definition 2

A real-valued $n \times n$ symmetric matrix $\mathbf{A}$ is said to be positive semidefinite if the scalar $\mathbf{x}^{T} \mathbf{A} \mathbf{x}$ is non-negative for every non-zero real-valued column vector $\mathbf{x}$ .

Theorem 1

If $\mathbf{A}$ is a real-valued $n \times n$ symmetric matrix, then the eigenvectors corresponding to different eigenvalues must be orthogonal to each other.

[http://www.math.hawaii.edu/~lee/linear/eigen.pdf]

Theorem 2

A real-valued $n \times n$ symmetric matrix is positive semidefinite if and only if all of its eigenvalues are non-negative.

[http://theanalysisofdata.com/probability/C4.html]

The Math

Let’s say we have a $n \times d$ data matrix $\mathbf{X}$ where $n$ is the number of observations and $d$ is the number of features — each observation is described by $d$ attributes.

$\mathbf{X} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{pmatrix}$

Now we wish to describe each observation by a single attribute. In other words, the $n \times d$ matrix $X$ will be turned into a $n \times 1$ column matrix (i.e. vector), and we want the variance of the sample composed of the $n$ elements from this vector to be as large as possible.

Project points from 2-D space to 1-D space

One way to do this (in the case of PCA) is to project the collection of $n$ points in $d$ -dimensional space into 1-dimensional space, or equivalently right-multiply $\mathbf{X}$ with a column vector $\alpha$

$\boldsymbol \alpha= \left( a_{1}, a_{2}, \cdots, a_{d} \right)^{T}$

giving rise to the $n \times 1$ transformed data matrix $\mathbf{Y}$

$\mathbf{Y} = \mathbf{X} \boldsymbol \alpha = \begin{pmatrix} x_{11}a_{1} + x_{12}a_{2} + \cdots + x_{1d}a_{d}\\ x_{21}a_{1} + x_{22}a_{2} + \cdots + x_{2d}a_{d}\\ \vdots \\ x_{n1}a_{1} + x_{n2}a_{2} + \cdots + x_{nd}a_{d} \end{pmatrix} = \begin{pmatrix} \sum_{j=1}^{d}x_{1j}a_{j} \\ \sum_{j=1}^{d}x_{2j}a_{j} \\ \vdots \\ \sum_{j=1}^{d}x_{nj}a_{j} \end{pmatrix} = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix}$

Note: For mathmatical convenience, the data matrix $\mathbf{X}$ is assumed to have been centered (each column sums to one): $\sum_{i=1}^{n} x_{ij} = 0$ , where $j=1 \dotsc d$

In case $\mathbf{X}$ had not been centered, you can simply subtract the column-mean from each element in each column.

Given that $\mathbf{X}$ has been centered, the transformed data matrix $\mathbf{Y}$ will be centered as well:

$\begin{align} \mathbf{\bar{Y}} &=\frac{1}{n} \sum_{i=1}^{n} y_{i} = \frac{1}{n} \sum_{i=1}^{n} \left( \sum_{j=1}^{d} x_{ij}a_{j} \right) = \frac{1}{n} \sum_{j=1}^{d} \left( \sum_{i=1}^{n} x_{ij}a_{j} \right) \\ &= \frac{1}{n} \sum_{j=1}^{d} a_{j}\left( \sum_{i=1}^{n} x_{ij} \right) = \frac{1}{n} \sum_{j=1}^{d} a_{j}\cdot 0 = 0 \end{align}$

Remember we wanted to make the variance of the sample composed of the $n$ elements from $\mathbf{Y}$ to be as large as possible. In other words, we want to maximize $\textrm{Var}(\mathbf{Y})$ (over the space of all $\boldsymbol \alpha$ ), which can be expressed in terms of $\mathbf{X}$ and $\boldsymbol \alpha$ :

$\begin{align} \textrm{Var}(\mathbf{Y}) & = \frac{1}{n - 1} \sum_{i=1}^{n} \left( y_{i} - \mathbf{\bar{Y}} \right)^{2}=\frac{1}{n -1}\sum_{i=1}^{n}y_{i}^{2} \\ & = \frac{1}{n - 1}\mathbf{Y}^{T}\mathbf{Y}=\boldsymbol\alpha^{T}\left( \frac{1}{n-1} \mathbf{X}^{T}\mathbf{X} \right) \boldsymbol\alpha \end{align}$

Here we can see that $\textrm{Var}(\mathbf{Y})$ is proportional to the norm of $\mathbf{Y}$ , which in turn depends on the norm of $\boldsymbol \alpha$ . If we were to allow the norm of $\boldsymbol \alpha$ to be unbounded, the $\textrm{Var}(\mathbf{Y})$ could be arbitrarily large!

To fix this we need to put some constraint on $\boldsymbol \alpha$ — we make it a unit-length vector: $\boldsymbol\alpha^{T} \boldsymbol\alpha = \sum_{j=1}^{d}a_{j}^{2} = 1$

So this turns the original function optimization problem into one with an equality constraint, which can be solved using Lagrangian Multiplier:

$L(\boldsymbol\alpha, \lambda) = \boldsymbol\alpha^{T} \left( \frac{1}{n - 1} \mathbf{X}^{T} \mathbf{X} \right) \boldsymbol\alpha - \lambda \left(\boldsymbol\alpha^{T} \boldsymbol\alpha - 1\right)$

Taking the derivative of the Lagrangian function $L$ with respect to $\boldsymbol \alpha$ and set it to zero,

$\begin{align} \frac{\partial}{\partial \boldsymbol\alpha} L(\boldsymbol\alpha, \lambda) & = \frac{1}{n - 1} \mathbf{X}^{T} \mathbf{X} \boldsymbol\alpha + \frac{1}{n - 1} \left( \mathbf{X}^{T} \mathbf{X} \right)^{T} \boldsymbol\alpha - \lambda \cdot 2 \boldsymbol \alpha \\ & = \frac{2}{n - 1} \mathbf{X}^{T} \mathbf{X} \cdot \boldsymbol\alpha - 2 \lambda \cdot \boldsymbol\alpha = 0 \end{align}$

we end up with the condition that $\boldsymbol \alpha$ must statisfy:

$\boldsymbol\Sigma \boldsymbol\alpha = \lambda \boldsymbol\alpha$

where $\boldsymbol\Sigma = \frac{1}{n - 1} \mathbf{X}^{T} \mathbf{X}$

According to Defitition 1, $\boldsymbol \alpha$ must be an eigenvector of $\boldsymbol \Sigma$ corresponding to eigenvalue $\lambda$

We can further show that the variance of $\mathbf{Y}$ happens to be equal to $\lambda$ :

$\textrm{Var}(\mathbf{Y}) = \boldsymbol\alpha^{T}\left( \frac{1}{n-1} \mathbf{X}^{T}\mathbf{X} \right) \boldsymbol\alpha = \boldsymbol\alpha^{T} \boldsymbol\Sigma \boldsymbol\alpha = \boldsymbol\alpha^{T} \lambda \boldsymbol\alpha = \lambda \cdot 1 = \lambda$

The $d\times d$ matrix $\boldsymbol \Sigma$ (i.e. the covariance matrix of $\mathbf{X}$ ) has the following properties:

$\boldsymbol\Sigma = \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X}$ is symmetric: $\boldsymbol \Sigma ^{T} = \left( \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X} \right)^{T} = \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X} = \boldsymbol\Sigma$
$\boldsymbol\Sigma = \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X}$ is semi-positive definite: For any real-valued non-zero vector $\mathbf{w} = \left(w_{1}, w_{2}, \cdots, w_{d} \right)^{T}$ ,

$\mathbf{w}^{T} \boldsymbol\Sigma \mathbf{w} = \mathbf{w}^{T} \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X} \mathbf{w} = \frac{1}{n-1} \left( \mathbf{X} \mathbf{w} \right)^{T} \mathbf{X} \mathbf{w} = \frac{1}{n-1} \Vert \mathbf{X} \mathbf{w} \Vert ^{2} \ge 0$

Suppose $\lambda_{1}, \lambda_{2}, \cdots, \lambda_{d}$ are the eigenvalues of $\boldsymbol \Sigma$ , and $\{\boldsymbol \alpha_{1i_{1}}\}, \{\boldsymbol \alpha_{2i_{2}}\}, \cdots, \{\boldsymbol \alpha_{di_{d}}\}$ are the sets of eigenvectors corresponding to each eigenvalue. Then $\lambda$ must be one of the eigenvalues, and $\boldsymbol \alpha$ must be an eigenvector corresponding to $\lambda$ .

According to Theorem 1, we know that eigenvectors $\{\boldsymbol \alpha_{1i_{1}}\}, \{\boldsymbol \alpha_{2i_{2}}\}, \cdots, \{\boldsymbol \alpha_{di_{d}}\}$ are orthogonal to each other.
According to Theorem 2, we know that the eigenvalues are non-negative — $\lambda_{1} \ge 0, \lambda_{2} \ge 0, \cdots, \lambda_{d} \ge 0$

Based on the properties of $\boldsymbol\Sigma$ , $\boldsymbol\alpha$ , $\lambda$ , and $\textrm{Var}(\mathbf{Y})$ , we can apply PCA on $\mathbf{X}$ in the following steps:

Compute the covariance matrix $\boldsymbol\Sigma = \frac{1}{n - 1} \mathbf{X}^{T} \mathbf{X}$ , where $\mathbf{X}$ should have been centered.
Find the eigenvalues $\lambda_{1}, \lambda_{2}, \cdots, \lambda_{d}$ of $\boldsymbol\Sigma$ , as well as the corresponding eigenvectors $\boldsymbol \alpha_{1}, \boldsymbol \alpha_{2}, \cdots, \boldsymbol \alpha_{d}$ . The eigenvalues should be in descending order: $\lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{d}$
Determine the number of features $k$ to keep in the transformed matrix ( $1 \le k \le d$ ).
Build the $d \times k$ matrix $M = \left( \boldsymbol \alpha_{1}, \boldsymbol \alpha_{2}, \cdots, \boldsymbol \alpha_{k} \right)$
Compute the transformed matrix $\mathbf{X}_{pca} = \mathbf{X}M = \left(\mathbf{X\alpha_{1}}, \mathbf{X\alpha_{2}}, \cdots, \mathbf{X\alpha_{k}} \right)$ .

And we can prove that the columns of $\mathbf{X}_{pca}$ have the properties that we desired:

$\mathbf{X\alpha_{1}}$ has the largest variance $\lambda_{1}$ , followed by $\mathbf{X\alpha_{2}}$ with variance $\lambda_{2}$ , and so on.
The columns of $\mathbf{X\alpha_{1}}$ are orthogonal with each other — Consider $X\boldsymbol\alpha_{i}$ and $X\boldsymbol\alpha_{j}$ where $1 \le i < j \le d$ ). It follows that

$\begin{align} \left( X\boldsymbol\alpha_{i} \right) ^{T} X\boldsymbol\alpha_{j} & = \boldsymbol\alpha_{i}^{T} X^{T} X \boldsymbol\alpha_{j} = (n-1)\boldsymbol\alpha_{i}^{T} \boldsymbol\Sigma \boldsymbol\alpha_{j} =(n-1)\boldsymbol\alpha_{i}^{T} \lambda_{j} \boldsymbol\alpha_{j} \\ &=(n-1)\lambda_{j} \boldsymbol\alpha_{i}^{T} \boldsymbol\alpha_{j} = (n-1)\lambda_{j} \cdot 0 = 0 \end{align}$

Example

We conclude our discussion with an example where we’ll generate a synthetic dataset and apply PCA as prescribed above.

First we will generate 2-D dataset with covariance matrix $\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 1.0 \end{pmatrix}$ and mean $(-1, 1)$ (shown in the figure below):

import numpy as np
import matplotlib.pyplot as plt


def generate_2Ddata(size, covariance, theta, offset):
  X = np.random.multivariate_normal([0, 0], covariance, size)
  rot_mat = np.array([[np.cos(theta), -np.sin(theta)], 
                      [np.sin(theta),  np.cos(theta)]])
  return X.dot(rot_mat) + offset

size = 1000
X = generate_2Ddata(size, [[0.1, 0.], [0., 1.]], np.pi/4, [-1., 1.])

Compute covariance matrix of a centered data matrix

# Center the data 
X = X - X.mean(axis=0)
# Compute covariance matrix
sigma = X.T.dot(X) / (size - 1)

Find the eigenvalues and eigenvectors, and sort them in descending order of eigenvalues.

# eigval[i] is the ith eigenvalue
# eigvec[:, i] is the eigenvector corresponding to eigval[i]
eigval, eigvec = np.linalg.eig(sigma)

indices = np.argsort(-eigval)

eigvec = eigvec[:, indices]

For now we’ll keep all principle components, but you can keep the top k components with largest variance:

eigvec = eigvec[:, :k]

Apply PCA as a linear transformation on X

X_pca = X.dot(eigvec)

Display PCA-transfomred data matrix in scatter plot

plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlim([-5, 5])
plt.ylim([-5, 5])
plt.xlabel('1st Principal Component')
plt.ylabel('2nd Principal Component')

Left: Original, uncentered. Middle: Centered. Right: PCA-transformed