ML Case-study Interview Question: Hybrid Topic Modeling & Neural Networks for Customer Feedback Theme Discovery & Monitoring.
Browse all the ML Case-Studies here.
Case-Study question
A large retailer collects text feedback from customers across multiple channels. They want an automated pipeline to group survey responses into key themes and identify unexpected or emerging issues. They have unstructured text data from multiple regions and business units. They already have internal partial labels for some responses and plan to add more labels with minimal human effort. They want a scalable system that uses topic modeling, contextual embeddings, and a neural network approach to refine discovered topics into interpretable themes that align with business needs. Propose an end-to-end solution, outlining the core models, data labeling strategies, and how you would monitor topic shifts over time to catch anomalies. Explain your full reasoning in detail and discuss how you would structure your pipeline, ensuring interpretability, human feedback loops, and minimal manual labeling.
Detailed Solution
Overview of the Pipeline
Use a multi-step pipeline with:
Text preprocessing
Latent Dirichlet Allocation (LDA) for initial topic discovery
A contextual embedding-based model (for example, BERTopic) for improved semantic coherence
A neural network projector to unify the discovered topics with partial human-labeled topics
A large language model (LLM) for few-shot labeling to reduce manual work
Monitoring and alerting mechanisms for shifts or spikes in topic frequencies
Text Preprocessing
Convert each survey response into a clean text representation. Remove stopwords, punctuation, and other noise. Tokenize the cleaned text. Store results in a document-term matrix for LDA, and maintain a token list for embedding-based methods.
Latent Dirichlet Allocation
Model each document as a mixture of topics. Words that appear together frequently indicate potential themes.
Parameters in text:
d: index over documents
n: index over words in a document
z_{dn}: topic assignment for the n-th word in document d
w_{dn}: the n-th word in document d
theta_d: distribution over topics for document d
alpha, beta: hyperparameters that control topic-document and word-topic distributions
LDA provides an initial set of topics. However, it relies only on term frequencies and may ignore nuanced word meanings.
Contextual Embeddings (BERTopic or Similar)
Obtain vector representations for each document using a deep language model. Cluster these embeddings to generate topics that capture semantic relationships. Output a topic distribution for each document. This step resolves homonyms and polysemy better than LDA.
Neural Network Topic Projection
Train a feed-forward neural network to map LDA and embedding-based distributions onto human-labeled categories. The model learns how machine-discovered topics correspond to a curated set of known themes. It still retains the ability to surface new topics when old labels do not apply.
Few-Shot Labeling with Large Language Models
Use a small set of manually labeled examples as prompt references. The LLM automatically labels new samples with the same categorization schema, expanding labeled data coverage. Carefully verify out-of-distribution edge cases via active sampling: if a new theme is underrepresented or mis-labeled, prompt human review. Re-train or fine-tune the projector model if labeling distributions shift.
Monitoring Topic Shifts and Anomalies
Aggregate topic probabilities for each day or week. Observe if any topic spikes over a threshold or grows faster in one region. Trigger alerts to investigate root causes. Compare timeframe-based topic proportions for trends.
Practical Example
If a new loyalty feature launches, scan feedback for any spike in topics referencing “points,” “membership,” or “discount.” The pipeline can detect that a new cluster emerges, investigate it as a new or mutated theme, then incorporate partial labels if needed.
Model Deployment
Implement the pipeline in a cloud-based environment. Use a job orchestration service to run data ingestion, modeling, labeling, and reporting. Store historical topic distributions in a database to facilitate time series analysis.
Possible Follow-Up Questions
How would you determine the optimal number of LDA topics?
Use topic coherence metrics (for example, coherence using normalized pointwise mutual information) or perplexity-based approaches. Train multiple models with varying topic counts. Select the count yielding consistent, interpretable topics without over-fragmentation. Confirm with a small manual check to see if topics align well with known categories.
Why use both LDA and a contextual embedding model instead of just one?
LDA captures coarse themes through word co-occurrence. Embedding methods capture semantic context. Combining both preserves classical statistical interpretability from LDA while leveraging the nuanced meaning from embeddings. The neural network projector unifies them to align with business-relevant human labels.
How do you handle mislabeled data from the few-shot LLM approach?
Review a small subset of automatically labeled samples. Identify systematic errors where the LLM fails. Collect human-labeled corrections for those specific cases. Re-prompt the LLM with more representative examples. Retrain or fine-tune the projector model to accommodate refined labels.
How do you ensure interpretability when using deep models?
Maintain transparency at each stage. For LDA, show top words in each topic. For the embedding-based model, show sample texts or key phrases that cluster together. For the projector network, document how each machine-learned topic maps onto your curated labels. Provide dashboards with both aggregated metrics and sample responses.
How would you scale this pipeline to large volumes of text data from multiple sources?
Distribute the pipeline across multiple compute resources. Parallelize text preprocessing and embedding computation. Use a distributed environment (for example, Apache Spark or a managed cloud solution) for large-scale LDA. Store final topics and distributions in a scalable database. Monitor resource usage and latency, optimizing batch or streaming modes as needed.