Principal Component Analysis is everywhere. Dimensionality reduction, feature extraction, noise filtering, visualization. The standard tutorial tells you to compute the covariance matrix, find its eigenvectors, project your data, and celebrate when the first few components explain "90% of the variance."
That last part — the explained variance — is where most practitioners stop thinking. It's also where the trouble begins.
The core problem: "Explained variance" measures how much of the total spread in your data is captured. It says nothing about whether that spread is signal or noise, structure or artifact.
What PCA Actually Computes
Let's be precise. Given a data matrix \(X\) with \(n\) observations and \(p\) features (centered to zero mean), PCA finds the directions of maximum variance.
The covariance matrix is:
PCA solves the eigenvalue problem:
The eigenvector \(v_1\) with the largest eigenvalue \(\lambda_1\) points in the direction of maximum variance. The second eigenvector \(v_2\) points in the direction of maximum remaining variance, orthogonal to \(v_1\). And so on.
"Explained variance" for the first \(k\) components is:
This is the ratio of variance captured by your chosen components to total variance. Geometrically: how much of the data's spread lies along the subspace you're keeping.
The Interpretation Problem
Here's where the textbook stops and reality starts.
Variance ≠ Signal. If your data has high-variance noise, PCA will faithfully capture that noise in the top components. The algorithm doesn't know what you care about. It finds spread, not meaning.
Consider a simple example: you have 100 features, one of which is your actual signal with variance 1. The other 99 are pure noise, each with variance 0.5. Total variance is \(1 + 99 \times 0.5 = 50.5\). The signal accounts for \(1/50.5 \approx 2\%\) of total variance.
Run PCA. The first component might explain 5% of variance — and be dominated by noise correlations. Your signal is buried. "90% explained variance" could mean you've captured 90% of the noise.
PCA implicitly assumes that directions of high variance are directions of interest. This is true when noise is isotropic (equal in all directions) and smaller than signal. In financial data, neither assumption typically holds.
Financial Data: Where It Breaks
Financial time-series violate PCA's assumptions in specific, predictable ways:
1. Non-Stationarity
PCA assumes your data is drawn from a fixed distribution. Financial returns are not. Volatility clusters. Correlations shift. The covariance matrix you estimate from the past may not apply to the future.
Worse: the eigenvectors themselves are unstable. Run PCA on rolling windows and watch the principal components rotate. The "market factor" you found last month might point in a different direction today.
2. Heavy Tails
Covariance is sensitive to outliers. Financial returns have fat tails — extreme moves occur more often than Gaussian assumptions predict. A few outliers can dominate your covariance estimate and distort the principal components.
The eigenvalues are affected too. In small samples with heavy-tailed data, the largest eigenvalues are systematically biased upward. You think the first component explains more than it actually does.
3. The Curse of Dimensionality
With \(p\) assets and \(n\) observations, you're estimating \(p(p+1)/2\) covariance parameters. If \(p\) is anywhere close to \(n\), your estimate is garbage — and so are the eigenvectors derived from it.
This isn't hypothetical. If you have 500 stocks and 252 trading days of data, you're estimating 125,250 parameters from 126,000 data points. The covariance matrix is nearly singular. The small eigenvalues are noise. The eigenvectors associated with them are meaningless.
4. Non-Linear Structure
PCA finds linear combinations. Financial relationships are often non-linear. Option prices depend non-linearly on underlying prices. Volatility relationships are non-linear. Credit spreads behave non-linearly near default thresholds.
PCA will miss this structure entirely, or worse, find spurious linear approximations that break down exactly when you need them most — during regime changes.
When PCA Works
This isn't a case against PCA. It's a case for understanding when it's appropriate:
- Dimensionality reduction for visualization: If you just want to plot high-dimensional data in 2D, the exact interpretation matters less. PCA gives you a reasonable projection.
- Preprocessing for regression: If you have multicollinear features, PCA can reduce dimensionality before fitting. But use cross-validation to choose the number of components — don't trust explained variance.
- Factor decomposition (with caveats): In equity markets, the first principal component often resembles a "market factor." But it's not the same as the market return, and it's not stable over time.
The Alternative: Ask Different Questions
Instead of "how much variance do these components explain?", ask:
- Are the components stable? Run PCA on different subsets of your data. If the eigenvectors rotate significantly, your decomposition isn't robust.
- Do the components predict what you care about? If you're using PCA for a downstream task (trading, risk modeling), evaluate on that task. Explained variance is a proxy at best.
- Is the covariance estimate reliable? If \(p/n > 0.1\), consider shrinkage estimators or factor models instead of sample covariance.
A Practitioner's Checklist
Before running PCA on financial data:
- Check stationarity. Plot rolling means and variances. If they drift, your covariance matrix is a moving target.
- Check the ratio \(n/p\). If it's less than 10, be skeptical of eigenvalue estimates.
- Examine eigenvalue distribution. If many eigenvalues are close together, the corresponding eigenvectors are poorly determined.
- Test stability. Bootstrap or use rolling windows. If PC1 today isn't recognizably PC1 last month, don't build strategies on it.
- Validate on the actual task. Explained variance is not your objective function.
Summary
PCA is a linear algebra operation, not a magic wand. It finds directions of maximum variance. Whether those directions are useful depends on your data and your problem — not on the explained variance percentage.
In financial data specifically:
- Non-stationarity makes eigenvectors unstable over time
- Heavy tails bias eigenvalue estimates
- High dimensionality makes covariance estimation unreliable
- Non-linear relationships are invisible to PCA
"90% explained variance" is not a success metric. It's a description of how much spread you've captured — signal and noise alike. The real question is always: does this decomposition help with what you're actually trying to do?