PCA: What "Explained Variance" Actually Means

The geometric interpretation, the assumptions that fail silently, and why 90% explained variance might be 90% noise.

Principal Component Analysis is everywhere. Dimensionality reduction, feature extraction, noise filtering, visualization. The standard tutorial tells you to compute the covariance matrix, find its eigenvectors, project your data, and celebrate when the first few components explain "90% of the variance."

That last part — the explained variance — is where most practitioners stop thinking. It's also where the trouble begins.

The core problem: "Explained variance" measures how much of the total spread in your data is captured. It says nothing about whether that spread is signal or noise, structure or artifact.

What PCA Actually Computes

Let's be precise. Given a data matrix \(X\) with \(n\) observations and \(p\) features (centered to zero mean), PCA finds the directions of maximum variance.

The covariance matrix is:

$$\Sigma = \frac{1}{n-1} X^T X$$

PCA solves the eigenvalue problem:

$$\Sigma v_i = \lambda_i v_i$$

The eigenvector \(v_1\) with the largest eigenvalue \(\lambda_1\) points in the direction of maximum variance. The second eigenvector \(v_2\) points in the direction of maximum remaining variance, orthogonal to \(v_1\). And so on.

"Explained variance" for the first \(k\) components is:

$$\text{Explained Variance} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{p} \lambda_i}$$

This is the ratio of variance captured by your chosen components to total variance. Geometrically: how much of the data's spread lies along the subspace you're keeping.

The Interpretation Problem

Here's where the textbook stops and reality starts.

Variance ≠ Signal. If your data has high-variance noise, PCA will faithfully capture that noise in the top components. The algorithm doesn't know what you care about. It finds spread, not meaning.

Consider a simple example: you have 100 features, one of which is your actual signal with variance 1. The other 99 are pure noise, each with variance 0.5. Total variance is \(1 + 99 \times 0.5 = 50.5\). The signal accounts for \(1/50.5 \approx 2\%\) of total variance.

Run PCA. The first component might explain 5% of variance — and be dominated by noise correlations. Your signal is buried. "90% explained variance" could mean you've captured 90% of the noise.

The Hidden Assumption

PCA implicitly assumes that directions of high variance are directions of interest. This is true when noise is isotropic (equal in all directions) and smaller than signal. In financial data, neither assumption typically holds.

Financial Data: Where It Breaks

Financial time-series violate PCA's assumptions in specific, predictable ways:

1. Non-Stationarity

PCA assumes your data is drawn from a fixed distribution. Financial returns are not. Volatility clusters. Correlations shift. The covariance matrix you estimate from the past may not apply to the future.

Worse: the eigenvectors themselves are unstable. Run PCA on rolling windows and watch the principal components rotate. The "market factor" you found last month might point in a different direction today.

2. Heavy Tails

Covariance is sensitive to outliers. Financial returns have fat tails — extreme moves occur more often than Gaussian assumptions predict. A few outliers can dominate your covariance estimate and distort the principal components.

The eigenvalues are affected too. In small samples with heavy-tailed data, the largest eigenvalues are systematically biased upward. You think the first component explains more than it actually does.

3. The Curse of Dimensionality

With \(p\) assets and \(n\) observations, you're estimating \(p(p+1)/2\) covariance parameters. If \(p\) is anywhere close to \(n\), your estimate is garbage — and so are the eigenvectors derived from it.

This isn't hypothetical. If you have 500 stocks and 252 trading days of data, you're estimating 125,250 parameters from 126,000 data points. The covariance matrix is nearly singular. The small eigenvalues are noise. The eigenvectors associated with them are meaningless.

4. Non-Linear Structure

PCA finds linear combinations. Financial relationships are often non-linear. Option prices depend non-linearly on underlying prices. Volatility relationships are non-linear. Credit spreads behave non-linearly near default thresholds.

PCA will miss this structure entirely, or worse, find spurious linear approximations that break down exactly when you need them most — during regime changes.

When PCA Works

This isn't a case against PCA. It's a case for understanding when it's appropriate:

The Alternative: Ask Different Questions

Instead of "how much variance do these components explain?", ask:

A Practitioner's Checklist

Before running PCA on financial data:

  1. Check stationarity. Plot rolling means and variances. If they drift, your covariance matrix is a moving target.
  2. Check the ratio \(n/p\). If it's less than 10, be skeptical of eigenvalue estimates.
  3. Examine eigenvalue distribution. If many eigenvalues are close together, the corresponding eigenvectors are poorly determined.
  4. Test stability. Bootstrap or use rolling windows. If PC1 today isn't recognizably PC1 last month, don't build strategies on it.
  5. Validate on the actual task. Explained variance is not your objective function.

Summary

PCA is a linear algebra operation, not a magic wand. It finds directions of maximum variance. Whether those directions are useful depends on your data and your problem — not on the explained variance percentage.

In financial data specifically:

"90% explained variance" is not a success metric. It's a description of how much spread you've captured — signal and noise alike. The real question is always: does this decomposition help with what you're actually trying to do?

← Back to Articles