Lecture 23: Dimensionality Reduction and PCA¶
Announcements:¶
- Friday: Practice Quiz 9; ask me anything (or don't)
Goals:¶
- Be able to give example use cases for dimensionality reduction
- Know the meaning of intrinsic dimensionality and ambient dimensionality
- Be able to use sklearn's PCA module to reduce the dimensionality of a dataset
- Get some intuition for random projections and their surprising distance-preservation properties.
# the packages for making 3D plots in jupyter aren't in the 311 env yet =(
# this will install them locally for your user
!pip install ipympl ipywidgets
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: ipympl in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (0.10.0) Requirement already satisfied: ipywidgets in /opt/miniforge/lib/python3.12/site-packages (8.1.8) Requirement already satisfied: ipython<10 in /opt/miniforge/lib/python3.12/site-packages (from ipympl) (9.12.0) Requirement already satisfied: matplotlib<4,>=3.5.0 in /opt/miniforge/lib/python3.12/site-packages (from ipympl) (3.10.8) Requirement already satisfied: numpy in /opt/miniforge/lib/python3.12/site-packages (from ipympl) (2.4.3) Requirement already satisfied: pillow in /opt/miniforge/lib/python3.12/site-packages (from ipympl) (12.2.0) Requirement already satisfied: traitlets<6 in /opt/miniforge/lib/python3.12/site-packages (from ipympl) (5.14.3) Requirement already satisfied: comm>=0.1.3 in /opt/miniforge/lib/python3.12/site-packages (from ipywidgets) (0.2.3) Requirement already satisfied: widgetsnbextension~=4.0.14 in /opt/miniforge/lib/python3.12/site-packages (from ipywidgets) (4.0.15) Requirement already satisfied: jupyterlab_widgets~=3.0.15 in /opt/miniforge/lib/python3.12/site-packages (from ipywidgets) (3.0.16) Requirement already satisfied: decorator>=5.1.0 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (5.2.1) Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (1.1.1) Requirement already satisfied: jedi>=0.18.2 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (0.19.2) Requirement already satisfied: matplotlib-inline>=0.1.6 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (0.2.1) Requirement already satisfied: pexpect>4.6 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (4.9.0) Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (3.0.52) Requirement already satisfied: pygments>=2.14.0 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (2.20.0) Requirement already satisfied: stack_data>=0.6.0 in /opt/miniforge/lib/python3.12/site-packages (from ipython<10->ipympl) (0.6.3) Requirement already satisfied: contourpy>=1.0.1 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.3.3) Requirement already satisfied: cycler>=0.10 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (4.62.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.5.0) Requirement already satisfied: packaging>=20.0 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (26.0) Requirement already satisfied: pyparsing>=3 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (3.3.2) Requirement already satisfied: python-dateutil>=2.7 in /opt/miniforge/lib/python3.12/site-packages (from matplotlib<4,>=3.5.0->ipympl) (2.9.0.post0) Requirement already satisfied: wcwidth in /opt/miniforge/lib/python3.12/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython<10->ipympl) (0.6.0) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /opt/miniforge/lib/python3.12/site-packages (from jedi>=0.18.2->ipython<10->ipympl) (0.8.6) Requirement already satisfied: ptyprocess>=0.5 in /opt/miniforge/lib/python3.12/site-packages (from pexpect>4.6->ipython<10->ipympl) (0.7.0) Requirement already satisfied: six>=1.5 in /opt/miniforge/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib<4,>=3.5.0->ipympl) (1.17.0) Requirement already satisfied: executing>=1.2.0 in /opt/miniforge/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython<10->ipympl) (2.2.1) Requirement already satisfied: asttokens>=2.1.0 in /opt/miniforge/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython<10->ipympl) (3.0.1) Requirement already satisfied: pure_eval in /opt/miniforge/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython<10->ipympl) (0.2.3)
import scipy
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
np.random.seed(311)
fig = None
pcafig = None
figvar = None
fig2d = None
figcircle = None
figbean = None
Toy example: a 2D subspace embedded in 3D¶
We'll construct a dataset where the data truly lives in a 2D subspace (a plane) inside 3D space, with a small amount of noise pushing points slightly off the plane.
If we can discover that the data is essentially 2D, we can represent it with 2 numbers per point instead of 3 — without losing much information.
# Dataset generation - you can ignore this code, but if you have
# the linear algebra background, you'll see what I did here.
n = 200
# Subspace basis vectors
u1 = np.array([1, 0.5, 0]) # points in the xy-diagonal
u1 = u1 / np.linalg.norm(u1) # make it unit length
u2 = np.linalg.cross(np.array([0, 1, 1]), u1) # find something orthogonal to it
u2 = u2 / np.linalg.norm(u2) # make that unit length
# Generate random points on the subspace plane
a = np.random.normal(0, 3.0, n) # large spread along u1
b = np.random.normal(0, 1.5, n) # smaller spread along u2
# Embed in the 3D space using the basis vectors and add a bit of noise
noise_scale = 0.2
X = np.outer(a, u1) + np.outer(b, u2) + np.random.normal(0, noise_scale, (n, 3))
%matplotlib widget
if not fig:
fig = plt.figure(figsize=(8, 6))
else:
fig.clf()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], s=10, alpha=0.4, color='steelblue')
ax.set_xlabel('x'); ax.set_xlim3d([-8, 8])
ax.set_ylabel('y'); ax.set_ylim3d([-8, 8])
ax.set_zlabel('z'); ax.set_zlim3d([-8, 8])
plt.tight_layout()
plt.show()
Principal Components Analysis: Intuition-only¶
PCA finds the directions (orthonormal basis vectors) in the ambient space (3D, in this case) along which the data has the highest variance.
pca = PCA(n_components=3)
pca.fit(X)
print("Principal component directions (rows):")
for i, (pc, var) in enumerate(zip(pca.components_, pca.explained_variance_ratio_)):
print(f" PC{i+1}: [{pc[0]:+.3f}, {pc[1]:+.3f}, {pc[2]:+.3f}] "
f"explains {var*100:.1f}% of variance")
Principal component directions (rows): PC1: [+0.877, +0.480, -0.033] explains 76.0% of variance PC2: [+0.380, -0.649, +0.659] explains 23.7% of variance PC3: [-0.294, +0.591, +0.751] explains 0.4% of variance
Notice that PC1 and PC2 explain nearly all the variance — the third component (off-plane noise) accounts for almost nothing.
Let's visualize what these directions look like:
%matplotlib widget
if not pcafig:
pcafig = plt.figure(figsize=(8, 6))
else:
pcafig.clf()
centroid = X.mean(axis=0)
arrow_colors = ['crimson', 'darkorange', 'gray']
arrow_labels = ['PC1 (most variance)', 'PC2', 'PC3 (noise direction)']
# Scale arrows by standard deviation along each component for visual clarity
arrow_scales = np.sqrt(pca.explained_variance_) * 1.5
ax = pcafig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], s=10, alpha=0.25, color='steelblue', label='data')
for i, (pc, color, label, scale) in enumerate(
zip(pca.components_, arrow_colors, arrow_labels, arrow_scales)):
end = centroid + scale * pc
ax.quiver(*centroid, *(end - centroid),
color=color, linewidth=2.5, arrow_length_ratio=0.00, label=label)
ax.set_xlabel('x'); ax.set_xlim3d([-8, 8])
ax.set_ylabel('y'); ax.set_ylim3d([-8, 8])
ax.set_zlabel('z'); ax.set_zlim3d([-8, 8])
ax.set_title('Principal component directions')
ax.legend(loc='upper left', fontsize=9)
# plt.tight_layout()
plt.show()
PC1 and PC2 lie in the plane of the data. PC3 is perpendicular to the plane — it points in the direction of the noise, where there is very little variance.
Explained variance¶
PCA gives us the direction (basis) vectors, and also tells us how much variance (information) each component captures:
if figvar is None:
figvar, ax = plt.subplots(figsize=(6, 4))
else:
figvar.clf(); ax=figvar.gca()
bars = ax.bar([f'PC{i+1}' for i in range(3)],
pca.explained_variance_ratio_ * 100,
color=['crimson', 'darkorange', 'gray'], edgecolor='black')
for bar, val in zip(bars, pca.explained_variance_ratio_ * 100):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
f'{val:.1f}%', ha='center', va='bottom', fontsize=11)
ax.set_ylabel('Explained variance (%)')
ax.set_title('Variance explained by each principal component')
ax.set_ylim(0, 100)
plt.tight_layout()
plt.show()
The first two components account for ~99% of the variance. PC3 captures only the noise. That's because the data lies (almost) on a plane - in other words, its intrinsic dimensionality is 2, even though its ambient dimensionality is 3.
Projecting to 2D: actually reducing dimensionality¶
If we drop PC3 and represent each point using only its coordinates along PC1 and PC2. This is the reduced representation.
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X) # shape: (n, 2)
if not fig2d:
fig2d, ax = plt.subplots(figsize=(6, 5))
else:
fig2d.clf(); ax=fig2d.gca()
ax.scatter(X_2d[:, 0], X_2d[:, 1], s=15, alpha=0.5, color='steelblue')
ax.set_xlabel('PC1 coordinate')
ax.set_ylabel('PC2 coordinate')
ax.quiver((0, 0), (0, 0), (0, 1), (1, 0), color=['darkorange', 'crimson'])
ax.set_title('Data projected onto first 2 principal components')
plt.tight_layout()
plt.show()
Think about applying some kind of ML algorithm - whether it's clustering, nearest-neighbor classification, or whatever else to this. Hopefully you're convinced that the results would be about the same on the reduced 2D data as it would have been on the higher-dimensional 3D data.
This worked really nicely in this case, because the subspace where all the data lives is linear.
When might this not work? If the subspace is not linear:
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
X, y = make_circles(n_samples=200, noise=0.02, factor=0.5, random_state=42)
X = X[y == 0]
plt.figure(figsize=(5, 5))
plt.scatter(X[:, 0], X[:, 1], s=20, alpha=0.7)
plt.axis("equal")
plt.title("make_circles dataset")
plt.tight_layout()
plt.show()
Part 2: Random Projections¶
PCA finds the optimal low-dimensional linear subspace by solving an eigenvalue problem — which requires computing a covariance matrix and doing an SVD. That's expensive for very high-dimensional data.
A surprising alternative: project onto random directions. It turns out that random projections also preserve pairwise distances quite well, even though they're not optimized at all.
This is formalized by the Johnson-Lindenstrauss lemma: for any set of $n$ points in high-dimensional space, there exists a projection to $k = O(\log n \;/\; \varepsilon^2)$ dimensions such that all pairwise distances are preserved within a factor of $(1 \pm \varepsilon)$. The remarkable part: $k$ depends only on $n$ and the desired accuracy $\varepsilon$, not on the original dimensionality.
Here, we're going to demo this with an example where we reduce a 500D dataset to a lower dimensionality, and see how the pairwise distances among points change.
Create dataset and compute true pairwise distances¶
np.random.seed(311)
# Dataset 1: 200 points in 500 dimensions
# with intrinsic dimensionality 500
n_points = 200
d_original = 500
X = np.random.randn(n_points, d_original)
# compute pairwise distances among all pairs of rows
orig_dists = scipy.spatial.distance.pdist(X)
Randomly project to a lower dimensionality¶
A random projection maps each point from $d$-dimensional space to $k$-dimensional space by multiplying by a random matrix $R$ of shape $(d, k)$ whose entries are i.i.d. $\mathcal{N}(0,\, 1/k)$.
The $1/k$ variance scaling ensures that the expected squared distance between any two projected points equals the original squared distance.
If you speak linear algebra, notice: this is not an orthonormal basis; it's a randomly chosen basis whose vectors can point any which way!
def random_project(X, k):
""" Build a random projection matrix R and project X onto the lower-dimensional basis
Return the projected data and the pairwise distances among the projected points """
R = np.random.randn(X.shape[1], k) * np.sqrt(1/k)
X_proj = X @ R
proj_dists = scipy.spatial.distance.pdist(X_proj)
return X_proj, proj_dists
X_proj, proj_dists = random_project(X, 50)
Calculate the ratio of (distance in projected space) / (distance in original space)¶
ratios = proj_dists / orig_dists
Visualize the distribution of these ratios:
def plot_distance_ratios(orig_dists, proj_dists):
ratios = proj_dists / orig_dists
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
ax = axes[0]
ax.scatter(orig_dists, proj_dists, s=2, alpha=0.3, color='steelblue')
lim = max(orig_dists.max(), proj_dists.max()) * 1.05
ax.plot([0, lim], [0, lim], 'r--', linewidth=1.5, label='perfect preservation')
ax.set_xlim([0, lim])
ax.set_xlim([0, lim])
ax.set_xlabel('Original distance (500D)')
ax.set_ylabel('Projected distance (50D)')
ax.set_title('Pairwise distance preservation')
ax.legend(fontsize=9)
ax = axes[1]
ax.hist(ratios, bins=50, color='steelblue', edgecolor='white')
ax.axvline(1.0, color='red', linestyle='--', linewidth=1.5, label='ratio = 1 (perfect)')
ax.set_xlabel('proj_dist / orig_dist')
ax.set_ylabel('Count')
ax.set_title('Distribution of distance ratios')
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()
plot_distance_ratios(orig_dists, proj_dists)
Some things to notice:
- The true original data doesn't lie on a lower-dimensional subspace!
- But the distances are still pretty well preserved. If it did lie on a subspace, distances would be preserved even better!
- Intuition: random projection of a planar subspace in 3D
- This isn't perfect, but it is cheap, and sometimes that's what you need.
Dimensionality Reduction for Visualization¶
beans = pd.read_csv('/cluster/academic/DATA311/202620/drybean.csv')
features = beans.drop(columns="Class")
labels = beans["Class"]
X = sklearn.preprocessing.StandardScaler().fit_transform(features)
sns.pairplot(data=features)
/opt/miniforge/lib/python3.12/site-packages/seaborn/axisgrid.py:123: UserWarning: Tight layout not applied. tight_layout cannot make Axes width small enough to accommodate all Axes decorations self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x14ff751849e0>
plt.show()
Another application of dimensionality reduction: I just want to look at the data!
X_pca = sklearn.decomposition.PCA().fit_transform(X)
X_pca.shape
(13611, 16)
sns.relplot(x=X_pca[:,0], y=X_pca[:,1], hue=labels)
plt.show()