ML Interview Q Series: How do you perform Principal Component Analysis (PCA)?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Principal Component Analysis (PCA) is a dimensionality reduction method that identifies directions (principal components) in the data that capture the maximum variance. The first principal component is the direction of greatest variance in the dataset, the second principal component is orthogonal (i.e., uncorrelated) to the first and captures the next highest variance, and so on.
A standard way to perform PCA involves centering the data, computing the covariance matrix (or sometimes using Singular Value Decomposition on the data matrix directly), and then obtaining its eigenvectors (principal components) and eigenvalues (variance explained by each principal component).
Steps to Perform PCA
Start by gathering your dataset
X
(N rows as samples and D columns as features).Mean-center each column of
X
. Subtract the mean of each feature from the respective column to ensure the data has zero mean. Optionally, you can also standardize each feature (divide by its standard deviation) if the scales of features differ significantly.Compute the covariance matrix of the centered (or standardized) dataset. The covariance matrix captures how each feature co-varies with every other feature.
Compute the eigenvalues and eigenvectors of this covariance matrix. Each eigenvector is a principal component direction, and the associated eigenvalue is the amount of variance captured in that direction.
Sort the eigenvalues in descending order and pick the top k eigenvectors to project your data onto a k-dimensional space that retains the most variance.
Core Mathematical Formulations
The covariance matrix Sigma of a zero-mean dataset can be expressed as:
Here, N is the number of samples, x_i is the i-th sample, and bar{x} is the mean vector of the dataset. The eigen-decomposition of Sigma gives you the principal components and the variance explained by each component.
Practical Implementation Considerations
In practice, many libraries such as scikit-learn (Python), TensorFlow, or PyTorch can perform PCA efficiently without manually coding all these steps. For example, in Python with scikit-learn:
import numpy as np
from sklearn.decomposition import PCA
# Suppose X is your data matrix of shape (N, D)
pca = PCA(n_components=2) # reduce to 2 components
X_transformed = pca.fit_transform(X)
# X_transformed now contains the data in 2D principal component space
# pca.explained_variance_ratio_ gives the variance explained by each component
Alternatively, if you use the covariance-matrix-eigen-decomposition route yourself, you might do:
import numpy as np
# X is (N, D), mean-center the data
X_centered = X - np.mean(X, axis=0)
# Covariance matrix
cov_matrix = np.cov(X_centered, rowvar=False)
# Eigen decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort eigenvalues/eigenvectors in descending order
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]
# Project the data onto the top k principal components
k = 2
X_pca = np.dot(X_centered, eigenvectors[:, :k])
Benefits and Limitations
PCA can remove noise and reveal latent structure if the principal components correspond to meaningful directions in the data. However:
It is a linear method and can miss non-linear structure.
It may be sensitive to outliers and scaling of features.
Real-world data might have correlated features that can still show up when many principal components are used.
Potential Follow-up Questions
Can you explain the meaning of the eigenvalues and eigenvectors in PCA?
Eigenvectors of the covariance matrix represent the principal component directions (i.e., axes of maximum variance), while the corresponding eigenvalues represent the magnitude of variance captured along those directions. If you sort the eigenvalues in descending order, the largest eigenvalue corresponds to the first principal component, which is the direction along which the data has the highest variance. Subsequent eigenvalues (and associated eigenvectors) account for decreasing amounts of variance and are orthogonal to previously chosen components.
How do you decide how many principal components to keep?
A common approach is to analyze the cumulative variance explained by the eigenvalues. By summing the eigenvalues in descending order and dividing by the total sum of all eigenvalues (the total variance), you see how much of the variance each principal component captures cumulatively. You can pick the number of components k such that the cumulative explained variance exceeds some threshold (e.g., 90% or 95%). Alternatively, some domains might specify a maximum dimension or a maximum reconstruction error to remain below.
How do you handle PCA when the number of features is very large?
When the dimensionality is very large (e.g., text data with tens of thousands of features), computing the covariance matrix can be computationally expensive. One approach is to use randomized algorithms (e.g., randomized SVD) that can handle high-dimensional data efficiently. Another approach is to use iterative methods like online PCA, which processes chunks of data incrementally rather than loading the entire dataset into memory.
What is the difference between using PCA on the covariance matrix vs. using SVD on the data matrix?
Both approaches yield the same principal components but differ in how you implement them. Computing the covariance matrix and then performing an eigendecomposition is often straightforward for relatively small or medium-dimensional problems. However, for very high dimensions, directly using an SVD (singular value decomposition) of the data matrix can be more stable and more computationally efficient, depending on your implementation. In fact, the principal components are the right singular vectors from the SVD of the centered data matrix, and the singular values squared correspond to the eigenvalues of the covariance matrix.
How do outliers affect PCA and how might you address them?
Outliers can disproportionately influence the principal components because the covariance matrix is sensitive to extreme values. This can distort the directions of maximum variance. Possible strategies to address this include removing outliers or reducing their impact (e.g., using robust scalers or robust PCA methods that incorporate techniques like M-estimators or that minimize the L1 norm instead of the L2 norm).
How do you interpret principal components in real-world datasets?
Principal components are directions of maximal variance, but they are often linear combinations of original features that can be difficult to interpret. Domain knowledge is crucial in relating principal components back to meaningful patterns in the data. Loading vectors (the eigenvectors themselves) can be examined to see which features contribute most strongly to each principal component. In some fields, it is common to rotate the principal components using a technique like varimax rotation to achieve a more interpretable factor structure.
Can PCA be applied to non-linear data structures?
Classical PCA is inherently linear. Non-linear relationships may not be captured well. Kernel PCA or other manifold learning algorithms (e.g., t-SNE or UMAP) can be used to capture more complex relationships. Kernel PCA maps the data into a higher-dimensional feature space where linear PCA is then performed, effectively capturing some non-linear patterns if the kernel is chosen appropriately.
Below are additional follow-up questions
How do you handle missing values when performing PCA?
In many real-world scenarios, your dataset might contain missing entries. Classical PCA cannot directly work with missing data, so you must either impute or remove those samples. Simple approaches include mean imputation (replacing missing values with the mean of that feature) or dropping rows or columns with missing values entirely. However, discarding too much data may reduce statistical power or bias your analysis. More sophisticated methods include using iterative algorithms that estimate missing entries while performing PCA, often referred to as "PCA with missing data" or "EM-based PCA." When applying imputation, you must be mindful of how it might distort the true structure of your dataset. Overly simplistic or incorrect imputation can lead to misleading principal components, so balancing an accurate imputation strategy with computational feasibility is key.
Potential pitfalls:
Inconsistent or biased imputations can shift directions of maximum variance.
Dropping rows or columns can reduce your dataset substantially, affecting the robustness of PCA.
Advanced imputation techniques (e.g., MICE - Multiple Imputation by Chained Equations) might be computationally expensive for large datasets.
In practice, how do you track or measure the amount of variance each component captures?
You can use the explained variance ratio for each principal component to see how much of the total dataset variance is explained by that component. If the eigenvalues of the covariance matrix are lambda_1, lambda_2, ..., lambda_D in descending order, then the fraction of variance captured by the j-th principal component is lambda_j / (sum of all lambda_i). By looking at these ratios, you gain insight into how many components you need to retain to represent most of the variance.
Potential pitfalls:
Focusing only on cumulative variance explained might cause you to overlook interpretability and domain relevance.
If data is heavily skewed or has outliers, the top principal components may capture outlier-driven variance rather than the underlying structure.
In a production environment, how might you apply the same PCA transform to new data?
In production scenarios, you typically fit PCA on a training set and then apply the same principal component projections to new incoming data. That means you must store:
The mean (and possibly the standard deviation if you used feature scaling).
The principal components (eigenvectors).
When new data arrives, you subtract the stored mean and project onto the stored eigenvectors to obtain principal component scores. This is critical so that new data is embedded in the same feature space learned during training.
Potential pitfalls:
Forgetting to apply the same mean-centering or scaling can drastically change the principal component space and ruin model performance.
Drifting data distributions (data drift) may render the original PCA components suboptimal over time, necessitating periodic re-training or updating of the PCA model.
Is it possible that PCA can degrade the predictive performance of a downstream model?
Yes, it is possible. Although PCA is often used for dimensionality reduction, the directions of maximum variance might not necessarily align with the directions that separate classes or predict a regression target well. For instance, if your predictive target depends on a low-variance feature, PCA might bury that feature in later components or discard it if you only keep a few top principal components. Therefore, while PCA can help reduce overfitting or computational complexity, you should always validate its effect on the downstream predictive performance.
Potential pitfalls:
Relying solely on variance-based feature extraction can eliminate low-variance but crucial predictors.
Overly reducing the dimension (choosing a small number of components) can cause underfitting.
How do you deal with negative loadings in principal components?
When you examine the principal components, you'll see that each eigenvector can have both positive and negative coefficients (often referred to as loadings or weights). A negative loading simply means that, in that principal component, an increase in that feature’s value tends to move the data point in the opposite direction of a feature with a positive loading. There is nothing inherently problematic with negative loadings; they are part of how PCA balances directions to maximize variance.
Potential pitfalls:
Misinterpreting negative loadings as “bad” or erroneous. In reality, they are just indicative of direction.
Overinterpreting signs when the principal components can be multiplied by -1 and still represent the same direction of variance. The sign of a principal component vector is arbitrary up to multiplication by -1.
In which cases might correlation-based PCA (i.e., using the correlation matrix) be more appropriate than covariance-based PCA?
By default, PCA is performed on the covariance matrix, which means features with larger numerical ranges can dominate the principal components. If your features are measured on different scales (e.g., some in kilograms, others in kilometers), performing PCA on the correlation matrix (equivalent to normalizing each feature to have unit variance) may be more appropriate. This approach is sometimes referred to as “standard PCA,” since it effectively standardizes each feature before applying PCA.
Potential pitfalls:
If you inadvertently scale features that shouldn’t be scaled, you might destroy meaningful proportions or relationships (e.g., in certain ratio-based or count-based data where absolute scale matters).
If the features are already on comparable scales, correlation-based PCA might add extra noise from standardizing each feature.
Why might you standardize or normalize data before PCA, and when might that be harmful?
If your features differ drastically in their ranges, the feature with the largest scale can dominate the covariance structure, causing the principal components to be biased toward capturing variance primarily from that feature. Standardizing (subtracting the mean and dividing by the standard deviation) ensures all features contribute more equally to the covariance matrix.
However, in situations where the magnitude of the features is inherently meaningful (for instance, absolute revenue vs. a normalized usage metric), standardizing can mask important signals. It might make sense to keep the data in original scale if domain knowledge suggests certain absolute measurements are crucial.
If the dataset has fewer samples than features (N << D), can we still apply PCA effectively?
Yes, though with caution. This scenario is often encountered in high-dimensional contexts such as genomics or text processing. Classical PCA using the covariance matrix might be prone to numerical instability or overfitting because the covariance matrix is not well-defined or is rank-deficient when N < D. Instead, SVD-based approaches or randomized PCA methods are typically more stable and computationally feasible. You can also use regularized PCA variants that impose shrinkage or constraints on the components to handle rank deficiencies more robustly.
Potential pitfalls:
Overfitting the components when you have fewer data points than features.
High computational cost and instability in eigen-decomposition of a large, sparse, or singular covariance matrix.
How do you evaluate if the PCA components are stable?
Stability refers to how sensitive the principal components are to small perturbations in the data. One practical approach is to perform PCA on bootstrapped or cross-validated subsamples of your data. If the directions and explained variances vary significantly across subsamples, that indicates instability. Another approach is to measure the angle between principal components found in different random subsets of data; a small angle means they are more consistent.
Potential pitfalls:
If data is noisy or you have many correlated features, small changes in the dataset can lead to large changes in the orientation of principal components.
Overemphasis on the raw shape of principal components. Even if the orientation changes slightly, the subspace spanned by the top components might still capture similar variance.
Can PCA be used to detect anomalies?
Yes. One approach is to reduce data to a lower-dimensional principal component subspace and then measure the reconstruction error or the distance of a point to that subspace. If a new sample lies far from the span of the principal components that contain most of the variance, it might be considered an outlier or anomaly. This technique is sometimes referred to as PCA-based anomaly detection or “Hotelling’s T-squared” method.
Potential pitfalls:
If anomalies dominate the variance in the dataset, PCA might align components to capture those outliers instead of the “normal” structure. In this case, outlier detection becomes less effective.
Defining a threshold for what distance or error qualifies as an anomaly can be subjective or domain-dependent.
How do you reconstruct the original data from the principal components, and what might be lost in this reconstruction?
After projecting your data onto the top k principal components, you can attempt to reconstruct it from those components. Let x_i be the mean-centered version of your original sample, and let e_j represent the j-th principal component (eigenvector). The reconstruction of x_i using the top k components is:
Where ( x_i^T e_j ) is the coefficient for how much x_i aligns with component e_j, and e_j is the direction vector itself.
Afterward, you usually add back the mean of your features if you subtracted it initially. Because you only keep the first k < D principal components, some variance has been intentionally discarded. That discarded variance is the information “lost” in this reconstruction, often referred to as the reconstruction error. The more principal components you keep, the lower that reconstruction error will be, but at the expense of higher dimensionality.
Potential pitfalls:
If k is too small, the reconstruction can be very poor, obscuring meaningful signals in your data.
If your data has strong non-linearities, linear reconstruction from principal components might not capture key structures.