ML Interview Q Series: Under what circumstances would you prefer to use manifold learning approaches instead of PCA?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Manifold learning and PCA are both dimensionality reduction strategies, but they rely on different assumptions about the data. PCA is a linear technique that identifies directions of maximal variance in the data via an eigenvalue decomposition or SVD of the covariance matrix. This approach assumes that the data lies (at least approximately) on a linear subspace. Manifold learning, on the other hand, aims to uncover nonlinear low-dimensional structures (manifolds) that could be embedded in a higher-dimensional space.
When dealing with data that clearly exhibits nonlinear relationships—structures like curved manifolds, complex surfaces, or data that is naturally arranged in a nonlinear fashion—manifold learning methods such as t-SNE, UMAP, Isomap, or LLE can often preserve the data’s intrinsic geometry more effectively than PCA. In many real-world tasks, images, text embeddings, or other high-dimensional representations may occupy complicated manifolds, and purely linear projections (like those from PCA) might not capture these non-linear characteristics well.
Manifold learning algorithms preserve either local or global distances in ways that reflect non-linear data relationships. They are often better at creating visually meaningful two or three-dimensional embeddings for complex datasets. However, these methods can be more computationally expensive, may require careful hyperparameter tuning (such as selecting the number of neighbors in LLE or Isomap, or perplexity in t-SNE), and do not always offer straightforward ways to transform out-of-sample data points once the embedding is learned.
Key Mathematical Formulation for t-SNE
One of the well-known manifold learning algorithms is t-SNE. It aims to minimize the divergence between probability distributions in high-dimensional space and low-dimensional space. The core cost function (C) that t-SNE optimizes is often expressed as a Kullback-Leibler (KL) divergence:
Here, p_{ij} is the pairwise similarity of data points i and j in the original high-dimensional space, and q_{ij} is the corresponding pairwise similarity in the low-dimensional embedding. The objective is to find an embedding that makes q_{ij} reflect p_{ij} as closely as possible. Since t-SNE handles distances in a nonlinear fashion (through Gaussian-based similarities and a student-t distribution in the low-dimensional space), it is particularly well-suited for unveiling complex manifold structures in data.
Potential Follow-up Questions
How is PCA mathematically formulated, and what are its assumptions?
PCA can be derived via an eigen-decomposition of the covariance matrix. Consider a dataset X of dimension N x D, where N is the number of examples and D is the dimensionality of each example. PCA computes the covariance matrix X^T X (often after mean-centering), then performs an eigen-decomposition or SVD to find principal components.
In PCA, we assume data lies near a linear subspace. This means that the directions of maximal variance are sufficient to describe the underlying structure of the data. If the data is distributed in a curved manifold, PCA’s linear projection might not preserve essential structure.
Is manifold learning always better than PCA?
Manifold learning is not universally better. While it is powerful for uncovering complex, nonlinear structures, it has limitations: • Computationally more expensive. • Usually requires tuning hyperparameters like number of neighbors, perplexity, or min_dist. • More prone to overfitting or distorted embeddings if parameters are not chosen carefully. • Out-of-sample extension might not be straightforward. PCA is simpler, often faster, and offers a clear interpretability of each principal component. For many tasks where data behaves linearly (or nearly so), PCA might be enough and is easier to deploy.
How do you handle new data points with manifold learning methods?
In PCA, it is straightforward to project a new data point onto the pre-computed principal components. In manifold learning (LLE, Isomap, t-SNE, UMAP), a new sample is not automatically embedded unless the algorithm has a specific out-of-sample extension method. Techniques like landmark Isomap or using precomputed nearest neighbors can help approximate the embedding for new samples. Alternatively, one might train an auxiliary model, such as a neural network, to learn the mapping from the original space to the low-dimensional space after manifold learning is done on a training set.
Why might local distance preservation matter more in certain datasets?
Many manifold learning techniques (e.g., LLE, t-SNE, UMAP) rely on preserving local neighborhood relationships instead of focusing on large global distances. If the data forms complicated clusters or non-linear neighborhoods, preserving local structure is critical for a meaningful low-dimensional representation. For example, in image recognition or natural language processing embeddings, the distances between very distant points might not be as informative as the relationships between close points in high-dimensional space.
Does manifold learning guarantee a unique embedding?
Most manifold learning methods do not guarantee a globally unique solution, especially if the objective is non-convex. Techniques like t-SNE may produce different results depending on initialization, random seeds, and hyperparameters (such as perplexity). Consequently, interpreting each axis in the 2D or 3D output might be less meaningful. The primary goal in manifold learning is to cluster similar points together and separate dissimilar ones, rather than produce a unique coordinate system as PCA does.
How do you choose between t-SNE, UMAP, LLE, or Isomap?
The choice depends on several factors: • t-SNE is widely used for visualization in 2D/3D; it is good at creating visually separated clusters but is sensitive to hyperparameters (perplexity) and can lose global structure. • UMAP is often faster than t-SNE and can preserve both local and some global structures reasonably well. • LLE preserves local linear relationships and can be easier to interpret, but might not work well if the manifold is highly non-linear. • Isomap preserves geodesic distances on a manifold but can struggle when the manifold is not truly connected or has complex curvature.
In practice, you often have to experiment with different methods and tune parameters. The choice is guided by dataset properties (size, dimension, geometry) and the downstream objective (visualization vs. capturing local distances for clustering).
Are there any considerations regarding data preprocessing before manifold learning?
It is often beneficial to: • Scale or normalize features to ensure no single dimension dominates distance computations. • Remove outliers if they significantly distort local neighborhood relationships. • Possibly do a preliminary dimensionality reduction with PCA to reduce computational cost, especially if the initial dimension is extremely large. Some users run manifold learning on the top principal components.
How does interpretability differ between PCA and manifold learning?
PCA produces principal components, which can be interpreted by examining the loadings (weights) each original feature contributes to a principal component. With manifold learning methods, the resulting coordinates often do not have a direct linear interpretation in terms of the original features. They are primarily meaningful in how they preserve local or global distances. If you need direct feature-level interpretability, PCA or other linear methods might be more suitable.
Could you apply manifold learning for large-scale datasets?
Manifold learning can be computationally heavy for large datasets, because it often requires constructing a nearest neighbor graph or calculating pairwise distances. Techniques such as approximate nearest neighbors, mini-batch approximations, or GPU-accelerated libraries can help. Methods like UMAP are generally more scalable than t-SNE, although they still might become expensive if the dataset is extremely large. Also, memory usage is a consideration, since we typically store distances or probabilities for all pairs of samples.
How do you determine the dimensionality for the embedding?
In PCA, the number of components can be chosen based on the explained variance. For manifold learning, there is no straightforward variance criterion. Common practice is to embed into 2D or 3D for visualization purposes. If you are using the reduced dimension for further downstream modeling, you might need to try multiple dimensions and evaluate performance on a validation set or use reconstruction error (for those methods that allow it).
Manifold learning can sometimes reveal inherent dimensionalities by analyzing the “intrinsic” dimension. For instance, if the manifold is truly a two-dimensional surface folded into a higher-dimensional space, exploring different embedding dimensions can help confirm whether a dimension 2 or 3 reveals the underlying structure.