ML Case-study Interview Question: Real-Time Fashion Recommendations: Leveraging Vector Search and Metadata Filters
Browse all the ML Case-Studies here.
Case-Study question
A leading online second-hand fashion marketplace wants to develop a multi-stage recommendation system that surfaces relevant item listings on user homepages. They start with an approximate nearest neighbor retrieval approach using vector embeddings derived from a two-tower model. The goal is to handle user preferences, allow filtering on item metadata, and keep retrieval latency low. They initially try a known vector similarity library but face challenges with real-time updates, metadata filtering, and overall throughput. They explore different vector search engines (including one that they already use for text-based search) and eventually decide on another open-source engine that handles both dense and sparse retrieval well. They run extensive benchmarks, discover performance differences, and then run an AB test to compare approximate vs exact search retrieval. Describe how you would design and implement this recommendation system. Show how you would evaluate various vector search engines, keep metadata-based filtering, and maintain minimal latency. Propose your end-to-end architecture (including training and deployment). Outline steps to measure business impact, including user satisfaction and engagement. Explain how you would handle iterative experimentation, including AB tests comparing approximate search to exact search.
Detailed solution
A recommendation system for second-hand fashion listings can combine embedding-based matching, metadata filtering, and real-time updates. The system can load user preferences, generate item embeddings, and filter items based on brand, size, or price. The design must handle fast indexing of new listings and removal of sold items. A multi-stage pipeline fits well.
The first stage retrieves potentially relevant items using approximate nearest neighbor methods. Storing item embeddings in a vector search engine gives quick retrieval. A two-tower model outputs user embeddings and item embeddings, and the distance between them represents relevance. Faster retrieval is paramount, so the engine must support prefiltering on metadata for user-specified conditions. Real-time updates let new items appear quickly and let sold items disappear. Finally, subsequent stages refine the retrieved set.
Two-tower embedding model
One tower produces listing embeddings from metadata like brand, price, and size. Another tower produces user embeddings from their past interactions. A distance metric, such as cosine similarity or Euclidean distance, measures user-item affinity. A typical L2 distance can be expressed as:
Here U is the user embedding, L is the listing embedding. The model is trained so that smaller distances indicate higher affinity.
Vector search engine evaluation
Evaluations can measure:
Indexing throughput: how fast new listings are indexed. Query throughput and latency: how many queries per second the engine can handle before latency grows. Ability to do prefiltering: some engines only do approximate nearest neighbor without index-level metadata filtering. Hardware resource usage: CPU or memory consumption at target throughput.
Implementation details
Start with real data containing listing embeddings. Spin up a test environment for each candidate engine. Insert a few million embeddings to test indexing throughput. Simulate query load (including user metadata filters) and collect performance metrics. Compare average and tail (P99) latency.
One engine might excel at indexing and handle 3-4x higher query throughput at lower latency. Another might have simpler setup or better synergy with existing search infrastructure. You then pick the best fit for your use case.
After deployment, a Docker-based environment variable configuration can simplify rolling out the engine cluster. Scale horizontally by adding more nodes. The engine’s built-in monitoring (such as Prometheus metrics) helps track ingestion, queries, and error rates.
Handling approximate vs exact retrieval
Approximate nearest neighbor uses structures like HNSW that sacrifice some recall for speed. Some engines allow switching from approximate:true to approximate:false. Exact retrieval might take 40% more latency on large datasets, so it is critical to see if that extra cost improves user satisfaction.
An AB test can compare approximate to exact. Run half of users with approximate, the other half with exact. Monitor click-through rates, purchase rates, and session times. If user gains from exact search are negligible, approximate might be best for resource efficiency.
AB testing and measuring impact
Implement feature flags or environment toggles so that some subset of traffic exercises the new retrieval pipeline. Monitor top-line metrics: items viewed, purchases, user retention. If metrics like purchases and user satisfaction improve, expand the experiment.
User feedback might reveal that new recommendations feel more tailored. The system might see a sharp increase in click-through rates from the homepage, showing improved personalization. Real-time conversation with power users can provide extra validation.
Follow-up questions
1) How would you handle real-time updates for both new listings and sold/deleted items?
Real-time updates require a search engine that can incrementally index or remove documents without lengthy rebuilds. An engine with live ingestion can apply partial updates. Each item is a document. When a listing is sold, remove that document by ID. When a new listing appears, insert it immediately. The engine’s replication ensures high availability. To keep consistent user experience, a quick ingest pipeline can parse item metadata, compute embeddings using your model, then push them into the search engine’s indexing endpoint. Ensure your system logs indexing failures and monitors insertion latency.
2) How do you evaluate whether approximate nearest neighbor or exact nearest neighbor is best for your final product?
Generate a controlled experiment. Approximate retrieval can have 60-70% recall compared to exact but often yields near-identical top recommendations. Evaluate user-level metrics: are interactions or conversions meaningfully different? If the difference is minimal, approximate is chosen for latency and throughput benefits. If your domain demands stricter ranking, exact might be worth the higher cost. Confirm the trade-off with repeated AB tests, focusing on precision and user engagement metrics.
3) How would you optimize the two-tower model for accuracy and performance?
Work on data preprocessing: remove noise, ensure user event signals are consistent. Tune model architecture: refine the embedding dimension, batch size, or learning rate. Collect user feedback signals: track both positive (purchases, clicks) and negative signals (skips). Evaluate offline with standard ranking metrics, then confirm online performance with AB tests. Deploy a new model version in a canary fashion. Compare changes in real user behavior and key metrics.
4) How do you handle custom filters (brand, size, etc.) when performing vector retrieval?
Use the engine’s capability to prefilter on metadata indexes. For instance, a query can require brand="X" or size="S" at the index level before the approximate nearest neighbor step. This avoids returning mismatched listings. If an engine cannot prefilter, you might retrieve a broader set of top-k items and filter afterward, but that risks empty results for highly specific filters. A system that filters at retrieval yields better user experience and performance.
5) How would you troubleshoot and debug latency spikes during high-load scenarios?
Gather traces from the engine for slow queries. Inspect the internal steps: network overhead, segment lookups, cache hits/misses, or concurrency limits. Observe system-level metrics: CPU, memory, disk usage. Test specific user queries or filters that trigger slow paths. A large or complex filter might degrade performance. Work with the engine’s community or support channels if logs show deeper indexing or search engine issues. Profile code paths in the service layer. Check if the embedding generation pipeline is throttling requests or if your client library is causing overhead.
6) Could you combine sparse and dense signals for more robust retrieval?
Yes. Hybrid search merges signals from text-based indices (sparse) and vector-based embeddings (dense). The search engine can combine textual relevance with embedding distance. This approach handles cases where textual match is critical. For instance, certain brand names, item categories, or user-entered queries might weigh heavily. Dense embeddings capture a broader semantic context, while sparse signals capture exact matches. This combo can improve coverage and relevance.
7) How can you ensure the system remains scalable and cost-efficient as the user base grows?
Observe scaling constraints: CPU usage, memory usage, network throughput. Split the search cluster into more shards. Each shard indexes a subset of the data. Auto-scale the cluster based on metrics like query latency or queue length. Apply incremental improvements like quantization or compressed embeddings if memory usage is high. Periodically measure cost vs performance. If certain dimension sizes are unnecessary, reduce them to cut memory overhead.
8) What additional aspects would you prioritize for continuous improvement?
Refine embeddings by capturing richer features: image data, deeper text features, user persona signals. Experiment with other vector indexing methods: tweak HNSW parameters or investigate different approximate algorithms. Incorporate user feedback loops or re-ranking models that refine results with deeper neural networks. Keep track of emerging vector databases and open-source solutions for new features or performance gains.