The naive question
Principal Component Analysis is everywhere — dimensionality reduction, feature extraction, noise filtering, visualisation. The standard tutorial tells you to compute the covariance matrix, find its eigenvectors, project the data, and celebrate when the first few components explain "90% of the variance".
That last part — the explained variance — is where most practitioners stop thinking. It is also where the trouble begins.
Key point: "Explained variance" measures how much of the total spread in the data is captured. It says nothing about whether that spread is signal or noise, structure or artifact.
What PCA actually computes
Be precise. Given a data matrix $X$ with $n$ observations and $p$ features (centred to zero mean), PCA finds the directions of maximum variance.
The sample covariance matrix is
PCA solves the eigenvalue problem
The eigenvector $v_1$ with the largest eigenvalue $\lambda_1$ points in the direction of maximum variance. The second eigenvector $v_2$ points in the direction of maximum remaining variance, orthogonal to $v_1$. And so on through $v_p$.
The "explained variance" of the first $k$ components is the ratio
Geometrically: how much of the data's spread lies along the subspace formed by the first $k$ principal axes. Mechanically: a number between zero and one that grows monotonically as more components are added.
The original formulation goes back to Pearson (1901), who framed the problem as fitting a hyperplane of closest approach to a cloud of points. Hotelling (1933) gave it the modern statistical clothing — covariance, eigenvectors, the language we still use today. The textbook treatment is Jolliffe (2002); the most cited recent review is Jolliffe and Cadima (2016).
The interpretation problem
This is where the textbook stops and reality starts.
Variance is not signal. If the data has high-variance noise, PCA will faithfully capture that noise in the top components. The algorithm does not know what you care about. It finds spread, not meaning.
Consider a stylised example: 100 features, one of which is the actual signal with variance 1; the other 99 are pure noise, each with variance 0.5. Total variance is $1 + 99 \times 0.5 = 50.5$. The signal accounts for $1/50.5 \approx 2%$ of the total. Run PCA. The first component will explain perhaps 5% of the variance — and it will be dominated by accidental correlations among the noise features. The signal is buried. "90% explained variance" in this setup means 90% of the noise has been captured.
PCA implicitly assumes that directions of high variance are directions of interest. This is true when noise is isotropic (equal in all directions) and smaller in magnitude than the signal. In financial data, neither assumption typically holds.
Where it breaks: financial data
Financial time-series violate PCA's assumptions in specific, predictable ways.
1. Non-stationarity
PCA assumes the data is drawn from a fixed distribution. Financial returns are not. Volatility clusters. Correlations shift. The covariance matrix estimated from the past does not always apply to the future.
The eigenvectors themselves are unstable. Run PCA on rolling windows and watch the principal components rotate. The "market factor" identified last month may point in a different direction today.
2. Heavy tails
Sample covariance is sensitive to outliers. Financial returns have fat tails — extreme moves occur far more often than Gaussian assumptions predict. A handful of outliers can dominate the covariance estimate and distort the principal components.
The eigenvalues are affected too. In small samples with heavy-tailed data, the largest eigenvalues are systematically biased upward. The first component appears to explain more than it actually does, and re-sampling tends to shrink it.
3. The curse of dimensionality and the noise floor
With $p$ assets and $n$ observations, the sample covariance has $p(p+1)/2$ free parameters. When $p$ is anywhere close to $n$, the estimate is unreliable and the eigenvectors derived from it are meaningless.
This is not hypothetical. Five hundred stocks across one trading year — about 252 days — means estimating roughly 125,000 parameters from 126,000 data points. The matrix is nearly singular; the small eigenvalues are statistical noise, and the eigenvectors associated with them carry no information.
Random matrix theory gives the principled answer to which eigenvalues correspond to signal and which to noise. For the sample covariance of $n$ observations of $p$ purely random Gaussian variables, the eigenvalue density follows the Marchenko-Pastur distribution, supported on $[\lambda_-, \lambda_+]$ with
where $\sigma^2$ is the per-variable variance. Any eigenvalue inside this band is consistent with pure noise; any eigenvalue substantially above $\lambda_+$ is candidate signal. Applied to financial returns (Laloux et al. 1999; Plerou et al. 1999), this draws a hard line under most of the spectrum: the bulk of the eigenvalues sits inside the Marchenko-Pastur range and carries no usable information. Only the few outliers above the band — typically dominated by a market mode and a small number of sector or style modes — correspond to identifiable structure.
Key point: In high-dimensional financial data, most eigenvalues are noise by construction. Marchenko-Pastur tells you exactly where the noise floor is.
4. Non-linear structure
PCA finds linear combinations. Financial relationships are often non-linear. Option prices depend non-linearly on underlying prices. Volatility relationships are non-linear. Credit spreads behave non-linearly near default thresholds.
PCA misses non-linear structure entirely or finds spurious linear approximations that break down exactly when they are most needed — during regime changes.
Robust alternatives
When the standard PCA assumptions fail, the literature offers principled responses.
Shrinkage estimators for the covariance matrix are the first line of defence in the high-dimensional regime. Ledoit and Wolf (2004) gave a closed-form, optimally weighted shrinkage of the sample covariance toward a structured target — typically a scaled identity. The shrunk matrix has bounded condition number, well-defined inverses, and eigenvalues no longer biased toward the extremes. The cost is a small loss of fit on the data; the gain is decisive in any downstream optimisation.
Probabilistic PCA (Tipping and Bishop 1999) reformulates PCA as a Gaussian latent-variable model with isotropic noise. It recovers the same principal directions in the noise-free limit and adds a likelihood, principled handling of missing data, and a basis for comparing models with different numbers of components.
Robust PCA trades efficiency for resistance to outliers — minimum covariance determinant estimators, projection pursuit, low-rank-plus-sparse decompositions. They are slower and less crisp than vanilla PCA but survive heavy-tailed contamination.
For non-linear structure, the standard escape hatch is kernel PCA — replace dot products with a kernel and work in an implicit higher-dimensional feature space. Kernel PCA recovers non-linear principal manifolds at the cost of interpretability and computational scaling.
When PCA works
This is not a case against PCA. It is a case for understanding when the method is appropriate.
- Dimensionality reduction for visualisation. Plotting high-dimensional data in 2D rarely depends on the exact interpretation of the axes. PCA produces a reasonable projection.
- Preprocessing for regression. With multicollinear features, PCA can reduce dimensionality before fitting. Use cross-validation to choose the number of components — do not trust the explained-variance percentage.
- Factor decomposition with caveats. In equity markets, the first principal component often resembles a "market factor". It is not the same as the market return, and it is not stable over time. Cleaned versions — using Marchenko-Pastur thresholds or shrinkage — tend to be more durable than raw PCA.
Asking different questions
Instead of "how much variance do these components explain?", the questions worth asking are:
- Are the components stable? Run PCA on different subsets of the data. If the eigenvectors rotate significantly between subsets, the decomposition is not robust. Bootstrap confidence intervals on eigenvectors are an honest way to express this.
- Do the components predict what is downstream? If PCA feeds a trading or risk model, evaluate on that model. Explained variance is a proxy at best.
- Is the covariance estimate trustworthy? When $p/n > 0.1$, prefer shrinkage estimators or factor models over the sample covariance. When $p/n > 0.5$, the sample covariance is essentially unusable on its own.
- Where is the noise floor? Compute the Marchenko-Pastur upper edge for the working dimensions. Any component below it should be presumed noise.
A practitioner's checklist
Before running PCA on financial data:
- Check stationarity. Plot rolling means, variances, and pairwise correlations. If they drift, the covariance is a moving target.
- Check the ratio $n/p$. If it is below 10, treat eigenvalue magnitudes as approximate. Below 2, assume the small eigenvalues are pure noise.
- Compute the noise floor. Marchenko-Pastur upper edge $\lambda_+ = \sigma^2(1 + \sqrt{p/n})^2$. Components below this are not signal.
- Test stability. Bootstrap or rolling windows. If $\text{PC}_1$ today is not recognisably $\text{PC}_1$ last month, it is unsafe to build a strategy on it.
- Validate on the downstream task. Explained variance is a description, not an objective.
- Consider shrinkage before raw sample covariance whenever $p/n$ is non-trivial.
When to use what
| Goal | Vanilla PCA | Shrunk-covariance PCA | Probabilistic PCA | Kernel PCA | RMT-cleaned PCA |
|---|---|---|---|---|---|
| 2D visualisation | ✓ | partial | partial | ✓ | partial |
| Multicollinear regression | ✓ | ✓ | ✓ | ✓ | partial |
| Risk model in high dimensions | — | ✓ | partial | — | ✓ |
| Missing data | — | — | ✓ | — | — |
| Non-linear structure | — | — | — | ✓ | — |
| Signal-vs-noise separation | — | partial | partial | — | ✓ |
Summary
PCA is a linear algebra operation, not a magic wand. It finds the directions of maximum variance. Whether those directions are useful depends on the data and the problem — not on the explained-variance percentage.
In financial data specifically:
- Non-stationarity makes eigenvectors unstable across time
- Heavy tails bias eigenvalue estimates upward
- High dimensionality makes the sample covariance unreliable; most eigenvalues are noise by Marchenko-Pastur
- Non-linear relationships are invisible to a linear method
"90% explained variance" is not a success metric. It is a description of how much spread has been captured — signal and noise alike. The honest question is always whether the decomposition helps with the downstream task.
References
- Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. 2nd ed. Springer. Chapter 14 on unsupervised learning, including PCA.
- Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441 and 24(7), 498–520.
- Jolliffe, I. T. (2002). Principal Component Analysis. 2nd ed. Springer.
- Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.
- Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters, 83(7), 1467–1470.
- Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.
- Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4), 457–483.
- Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine, 6th series, 2(11), 559–572.
- Plerou, V., Gopikrishnan, P., Rosenow, B., Amaral, L. A. N., & Stanley, H. E. (1999). Universal and nonuniversal properties of cross correlations in financial time series. Physical Review Letters, 83(7), 1471–1474.
- Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611–622.