ML Interview Q Series: How do you perform Principal Component Analysis (PCA)?

Apr 07, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Principal Component Analysis (PCA) is a dimensionality reduction method that identifies directions (principal components) in the data that capture the maximum variance. The first principal component is the direction of greatest variance in the dataset, the second principal component is orthogonal (i.e., uncorrelated) to the first and captures the next highest variance, and so on.

Connect with me on X (Twitter)

A standard way to perform PCA involves centering the data, computing the covariance matrix (or sometimes using Singular Value Decomposition on the data matrix directly), and then obtaining its eigenvectors (principal components) and eigenvalues (variance explained by each principal component).

Steps to Perform PCA

Start by gathering your dataset X (N rows as samples and D columns as features).
Mean-center each column of X. Subtract the mean of each feature from the respective column to ensure the data has zero mean. Optionally, you can also standardize each feature (divide by its standard deviation) if the scales of features differ significantly.
Compute the covariance matrix of the centered (or standardized) dataset. The covariance matrix captures how each feature co-varies with every other feature.
Compute the eigenvalues and eigenvectors of this covariance matrix. Each eigenvector is a principal component direction, and the associated eigenvalue is the amount of variance captured in that direction.
Sort the eigenvalues in descending order and pick the top k eigenvectors to project your data onto a k-dimensional space that retains the most variance.

Core Mathematical Formulations

The covariance matrix Sigma of a zero-mean dataset can be expressed as:

Here, N is the number of samples, x_i is the i-th sample, and bar{x} is the mean vector of the dataset. The eigen-decomposition of Sigma gives you the principal components and the variance explained by each component.

Practical Implementation Considerations

In practice, many libraries such as scikit-learn (Python), TensorFlow, or PyTorch can perform PCA efficiently without manually coding all these steps. For example, in Python with scikit-learn:

import numpy as np
from sklearn.decomposition import PCA

# Suppose X is your data matrix of shape (N, D)
pca = PCA(n_components=2)  # reduce to 2 components
X_transformed = pca.fit_transform(X)

# X_transformed now contains the data in 2D principal component space
# pca.explained_variance_ratio_ gives the variance explained by each component

Alternatively, if you use the covariance-matrix-eigen-decomposition route yourself, you might do:

import numpy as np

# X is (N, D), mean-center the data
X_centered = X - np.mean(X, axis=0)

# Covariance matrix
cov_matrix = np.cov(X_centered, rowvar=False)

# Eigen decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Sort eigenvalues/eigenvectors in descending order
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]

# Project the data onto the top k principal components
k = 2
X_pca = np.dot(X_centered, eigenvectors[:, :k])

Benefits and Limitations

PCA can remove noise and reveal latent structure if the principal components correspond to meaningful directions in the data. However:

It is a linear method and can miss non-linear structure.
It may be sensitive to outliers and scaling of features.
Real-world data might have correlated features that can still show up when many principal components are used.

Potential Follow-up Questions