Principal components analysis (PCA)#

Let \(\boldsymbol X \in \mathbb R^{n\times d}\) be feature matrix which consists of centered data: \(\sum\limits_{i=1}^n \boldsymbol x_i^\mathsf{T} = \boldsymbol 0\). We would like to approximate each \(\boldsymbol x_i\) by a low dimensional representation, \(\boldsymbol z_i \in \mathbb R^\ell\), \(\ell < d\). In PCA we assume that each \(\boldsymbol x_i\) can be approximately represented as a linear combination of vectors \(\boldsymbol w_1,\ldots, \boldsymbol w_\ell \in \mathbb R^d\):

\[ \boldsymbol x_i \approx \sum \limits_{k=1}^\ell z_{ik} \boldsymbol w_k. \]

Let \(\boldsymbol Z = (z_{ik})\) be a \(n\times\ell\) matrix,

\[ \boldsymbol W = [\boldsymbol w_1 \ldots \boldsymbol w_\ell] \in \mathbb R^{d \times \ell}. \]

In PCA we minimize the loss function

\[ \mathcal L(\boldsymbol W) = \frac 1n\big\Vert \boldsymbol X - \boldsymbol{ZW}^\mathsf{T}\big\Vert_F^2 = \frac 1n\big\Vert \boldsymbol X^\mathsf{T} - \boldsymbol{WZ}^\mathsf{T}\big\Vert_F^2 = \frac 1n\sum\limits_{i=1}^n \big\Vert \boldsymbol x_i - \boldsymbol {Wz}_i\big\Vert_2^2. \]

This loss function attains its minimum when

\[ \widehat{\boldsymbol W} = \boldsymbol U_\ell = [\boldsymbol u_1 \ldots \boldsymbol u_\ell] \]

where \(\boldsymbol u_1, \ldots, \boldsymbol u_\ell\) are eigenvectors of \(\ell\) largest eigenvalues of \(\boldsymbol X^\mathsf{T}\boldsymbol X\).

One can also show based on singular value decomposition of \(\boldsymbol X\)

\[ \boldsymbol X = \boldsymbol{U \Sigma V}^\mathsf{T} \]

that \(\widehat{\boldsymbol W}\) equals to first \(\ell\) columns of \(\boldsymbol V\).

PCA on MNIST#

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X = X.astype(float).values / 255
y = y.astype(int).values
X -= np.mean(X, axis=0, keepdims=True)

X_train, X_vis, y_train, y_vis = train_test_split(X, y, test_size=10000, stratify=y, random_state=12)

Perform PCA with 2 components:

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_vis)
print("Explained variance:", pca.explained_variance_) 
print("Explained variance ratio:", pca.explained_variance_ratio_.sum())
Explained variance: [5.16209074 3.77201984]
Explained variance ratio: 0.1693879257542866

Visulize the result of PCA:

scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_vis, cmap='tab10',
                      s=15, alpha=0.8, edgecolors='k')
legend_labels = [str(i) for i in range(10)]
plt.legend(handles=scatter.legend_elements()[0], title='Digits', labels=legend_labels)
plt.title('PCA on MNIST')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2');
../_images/6751c1f9f53182cd595479feda844f02ef8c785deff80b9d0a9ab55c65dfe5ed.svg

t-SNE on MNIST#

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_vis)
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_vis, cmap='tab10',
                      s=15, alpha=0.8, edgecolors='k')
legend_labels = [str(i) for i in range(10)]
plt.legend(handles=scatter.legend_elements()[0], title='Digits', labels=legend_labels)
plt.title('t-SNE on MNIST')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2');
../_images/b4faceb3333b39db6861f22d04cf6c6c422f72c657424836727df6b2de5dafcf.svg