PCA — What "Explained Variance" Actually Means

What PCA actually finds, what it misses, and why 90% explained variance is a description, not a success metric

The naive question

Principal Component Analysis is everywhere — dimensionality reduction, feature extraction, noise filtering, visualisation. The standard tutorial tells you to compute the covariance matrix, find its eigenvectors, project the data, and celebrate when the first few components explain "90% of the variance".

That last part — the explained variance — is where most practitioners stop thinking. It is also where the trouble begins.

Key point: "Explained variance" measures how much of the total spread in the data is captured. It says nothing about whether that spread is signal or noise, structure or artifact.

What PCA actually computes

Be precise. Given a data matrix $X$ with $n$ observations and $p$ features (centred to zero mean), PCA finds the directions of maximum variance.

The sample covariance matrix is

$$\Sigma = \frac{1}{n-1}, X^\top X.$$

PCA solves the eigenvalue problem

$$\Sigma, v_i = \lambda_i, v_i.$$

The eigenvector $v_1$ with the largest eigenvalue $\lambda_1$ points in the direction of maximum variance. The second eigenvector $v_2$ points in the direction of maximum remaining variance, orthogonal to $v_1$. And so on through $v_p$.

The "explained variance" of the first $k$ components is the ratio

$$\text{EV}(k) = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{p} \lambda_i}.$$

Geometrically: how much of the data's spread lies along the subspace formed by the first $k$ principal axes. Mechanically: a number between zero and one that grows monotonically as more components are added.

The original formulation goes back to Pearson (1901), who framed the problem as fitting a hyperplane of closest approach to a cloud of points. Hotelling (1933) gave it the modern statistical clothing — covariance, eigenvectors, the language we still use today. The textbook treatment is Jolliffe (2002); the most cited recent review is Jolliffe and Cadima (2016).

The interpretation problem

This is where the textbook stops and reality starts.

Variance is not signal. If the data has high-variance noise, PCA will faithfully capture that noise in the top components. The algorithm does not know what you care about. It finds spread, not meaning.

Consider a stylised example: 100 features, one of which is the actual signal with variance 1; the other 99 are pure noise, each with variance 0.5. Total variance is $1 + 99 \times 0.5 = 50.5$. The signal accounts for $1/50.5 \approx 2%$ of the total. Run PCA. The first component will explain perhaps 5% of the variance — and it will be dominated by accidental correlations among the noise features. The signal is buried. "90% explained variance" in this setup means 90% of the noise has been captured.

Warning

PCA implicitly assumes that directions of high variance are directions of interest. This is true when noise is isotropic (equal in all directions) and smaller in magnitude than the signal. In financial data, neither assumption typically holds.

Where it breaks: financial data

Financial time-series violate PCA's assumptions in specific, predictable ways.

1. Non-stationarity

PCA assumes the data is drawn from a fixed distribution. Financial returns are not. Volatility clusters. Correlations shift. The covariance matrix estimated from the past does not always apply to the future.

The eigenvectors themselves are unstable. Run PCA on rolling windows and watch the principal components rotate. The "market factor" identified last month may point in a different direction today.

2. Heavy tails

Sample covariance is sensitive to outliers. Financial returns have fat tails — extreme moves occur far more often than Gaussian assumptions predict. A handful of outliers can dominate the covariance estimate and distort the principal components.

The eigenvalues are affected too. In small samples with heavy-tailed data, the largest eigenvalues are systematically biased upward. The first component appears to explain more than it actually does, and re-sampling tends to shrink it.

3. The curse of dimensionality and the noise floor

With $p$ assets and $n$ observations, the sample covariance has $p(p+1)/2$ free parameters. When $p$ is anywhere close to $n$, the estimate is unreliable and the eigenvectors derived from it are meaningless.

This is not hypothetical. Five hundred stocks across one trading year — about 252 days — means estimating roughly 125,000 parameters from 126,000 data points. The matrix is nearly singular; the small eigenvalues are statistical noise, and the eigenvectors associated with them carry no information.

Random matrix theory gives the principled answer to which eigenvalues correspond to signal and which to noise. For the sample covariance of $n$ observations of $p$ purely random Gaussian variables, the eigenvalue density follows the Marchenko-Pastur distribution, supported on $[\lambda_-, \lambda_+]$ with

$$\lambda_\pm = \sigma^2 \left(1 \pm \sqrt{p/n}\right)^2,$$

where $\sigma^2$ is the per-variable variance. Any eigenvalue inside this band is consistent with pure noise; any eigenvalue substantially above $\lambda_+$ is candidate signal. Applied to financial returns (Laloux et al. 1999; Plerou et al. 1999), this draws a hard line under most of the spectrum: the bulk of the eigenvalues sits inside the Marchenko-Pastur range and carries no usable information. Only the few outliers above the band — typically dominated by a market mode and a small number of sector or style modes — correspond to identifiable structure.

Key point: In high-dimensional financial data, most eigenvalues are noise by construction. Marchenko-Pastur tells you exactly where the noise floor is.

4. Non-linear structure

PCA finds linear combinations. Financial relationships are often non-linear. Option prices depend non-linearly on underlying prices. Volatility relationships are non-linear. Credit spreads behave non-linearly near default thresholds.

PCA misses non-linear structure entirely or finds spurious linear approximations that break down exactly when they are most needed — during regime changes.

Robust alternatives

When the standard PCA assumptions fail, the literature offers principled responses.

Shrinkage estimators for the covariance matrix are the first line of defence in the high-dimensional regime. Ledoit and Wolf (2004) gave a closed-form, optimally weighted shrinkage of the sample covariance toward a structured target — typically a scaled identity. The shrunk matrix has bounded condition number, well-defined inverses, and eigenvalues no longer biased toward the extremes. The cost is a small loss of fit on the data; the gain is decisive in any downstream optimisation.

Probabilistic PCA (Tipping and Bishop 1999) reformulates PCA as a Gaussian latent-variable model with isotropic noise. It recovers the same principal directions in the noise-free limit and adds a likelihood, principled handling of missing data, and a basis for comparing models with different numbers of components.

Robust PCA trades efficiency for resistance to outliers — minimum covariance determinant estimators, projection pursuit, low-rank-plus-sparse decompositions. They are slower and less crisp than vanilla PCA but survive heavy-tailed contamination.

For non-linear structure, the standard escape hatch is kernel PCA — replace dot products with a kernel and work in an implicit higher-dimensional feature space. Kernel PCA recovers non-linear principal manifolds at the cost of interpretability and computational scaling.

When PCA works

This is not a case against PCA. It is a case for understanding when the method is appropriate.

Asking different questions

Instead of "how much variance do these components explain?", the questions worth asking are:

A practitioner's checklist

Before running PCA on financial data:

  1. Check stationarity. Plot rolling means, variances, and pairwise correlations. If they drift, the covariance is a moving target.
  2. Check the ratio $n/p$. If it is below 10, treat eigenvalue magnitudes as approximate. Below 2, assume the small eigenvalues are pure noise.
  3. Compute the noise floor. Marchenko-Pastur upper edge $\lambda_+ = \sigma^2(1 + \sqrt{p/n})^2$. Components below this are not signal.
  4. Test stability. Bootstrap or rolling windows. If $\text{PC}_1$ today is not recognisably $\text{PC}_1$ last month, it is unsafe to build a strategy on it.
  5. Validate on the downstream task. Explained variance is a description, not an objective.
  6. Consider shrinkage before raw sample covariance whenever $p/n$ is non-trivial.

When to use what

Goal Vanilla PCA Shrunk-covariance PCA Probabilistic PCA Kernel PCA RMT-cleaned PCA
2D visualisation partial partial partial
Multicollinear regression partial
Risk model in high dimensions partial
Missing data
Non-linear structure
Signal-vs-noise separation partial partial

Summary

PCA is a linear algebra operation, not a magic wand. It finds the directions of maximum variance. Whether those directions are useful depends on the data and the problem — not on the explained-variance percentage.

In financial data specifically:

"90% explained variance" is not a success metric. It is a description of how much spread has been captured — signal and noise alike. The honest question is always whether the decomposition helps with the downstream task.

References

← Back to Articles