ML Case-study Interview Question: Matching Documents to Saved Queries with Elasticsearch Percolator
Browse all the ML Case-Studies here.
Case-Study question
You are given a scenario where a large organization maintains a broad, federated graph of data about its movie titles. They have a search system that can handle standard queries (like “movies with language = Spanish and city = Mexico City”), returning matches from a main Elasticsearch index. They need to invert this flow, discovering which saved queries would match a new or updated movie. For example, whenever a movie changes, they want to find all saved queries that would now include that movie. The team has implemented this by storing user-created queries in a dedicated percolate index, then sending new or updated documents (movies) to that percolate index to retrieve the matching queries. How would you design and implement this reverse search solution, ensuring minimal impact on the broader federated ecosystem while handling features such as index versioning, indexing pipelines, and schema evolution?
Provide a detailed solution approach. Address architecture, data flows, system design concerns, user experience aspects for subscribing to dynamic query results, and practical issues such as performance or versioning. Propose how to generate and store the user’s saved query, and how to match a changing document against those saved queries. Explain any trade-offs and potential expansions.
Detailed Solution
Graph-based microservices at this organization continuously generate documents about their movies. A separate search system handles standard indexing and querying. To solve the reverse search requirement, they introduce a parallel percolate index on Elasticsearch. They store user-defined search filters in that percolate index as queries, not as data documents. Whenever a movie changes, the system sends that movie’s document to the percolate index to retrieve all matching saved queries.
Separate Percolate Index
They maintain two distinct Elasticsearch indices for every version of a domain: one for normal searching, and one for percolating user-saved queries. This separation allows different performance tuning and mapping definitions. The main index uses the standard pipeline for indexing movie documents. The percolate index contains a special field (type = percolator) that holds the Elasticsearch query itself.
Storing User Filters
They store each saved query in a relational database with only the organization’s custom DSL string. When new or updated entries appear in that table, a change data capture event flows into a pipeline that checks whether the saved query belongs to a given domain index. That pipeline calls a service that translates the DSL string into an Elasticsearch-compatible query. The pipeline then indexes the translated query document into the percolate index.
Querying
When a movie document changes, the system constructs a JSON representation that aligns with the same field mappings used by the main index. It sends this to Elasticsearch as a percolate query. Elasticsearch returns the list of saved queries that match.
Versioning
They create new versions of both the main index and the percolate index when mappings change, such as adding a new field. The existing version remains active until the new version is fully populated. To handle older saved queries, the pipeline replays them from the log-compacted topic, translating their DSL to the updated index’s format and indexing them into the new percolate index. Once the system has all data reindexed, it flips the alias from old index to new index, ensuring a clean cutover.
Expanded Use Case
Another engineering team uses the same percolate mechanism to classify a movie into workflows based on complex matching rules. Rather than manually assigning a movie to a certain workflow, they store classification rules as saved queries in the percolate index, then submit the movie document for matching. The returned queries dictate which workflows to trigger.
Potential Subscriptions
A future enhancement might maintain a push-based approach. Rather than a user searching repeatedly, a subscription could attach to a saved query, relying on the percolate index to alert the system whenever a movie changes in a way that now satisfies that query. This would give near real-time updates to users.
Example Python Code
import requests
def percolate_movie(movie_doc):
query_url = "http://elasticsearch.mycompany/percolate_index/_search"
payload = {
"query": {
"percolate": {
"field": "percolate_query",
"document": movie_doc
}
}
}
response = requests.post(query_url, json=payload)
return response.json()
movie_document = {
"movieTitle": "Some Title",
"isArchived": False,
"country": "Mexico",
"language": "Spanish"
}
matching_saved_queries = percolate_movie(movie_document)
This snippet shows how a service could send a new or updated movie document to Elasticsearch. The response includes a list of queries that match.
How to Handle Follow-Up Questions
What happens if a saved query references a field that no longer exists?
When a domain index removes a field, the percolate index can break if an old saved query still references that removed field. The pipeline that replays saved queries catches these failures, logs them to a dead-letter queue, and alerts the relevant teams. They can delete or update the invalid queries. The partial isolation of percolate indices also prevents invalid queries from impacting normal read operations.
How do you minimize performance overhead in a system with many saved queries?
They avoid constant re-queries by storing queries in the percolate index, which efficiently matches many queries in a single pass. They also keep separate indices for documents and queries, allowing advanced caching and query-time performance tuning. Inserting documents or queries triggers minimal overhead because the main cost is Elasticsearch’s indexing, which is horizontally scalable.
How do you ensure consistent mappings between the normal index and the percolate index?
They rely on index templates in Elasticsearch. They create a template with the shared mapping definitions for the movie fields and apply it to both the main index and the corresponding percolate index. They add a secondary template for the percolate index that extends the shared mappings, adding the percolator field.
What if the DSL language needs to evolve?
They store the original DSL in a database so that any new translation logic can be applied retroactively to older saved queries. This means they do not lose user-defined filters. A new index version triggers a replay of all saved queries with the updated translator logic. The new percolate index receives the re-translated queries.
How can this scale to thousands of updates per second?
They scale Elasticsearch horizontally, distributing documents and queries across multiple shards. They also employ an event-driven pipeline for ingesting and translating queries. Clustering ensures that percolate operations remain performant. They monitor cluster usage and add more nodes if throughput or latency worsens.
How can you extend this system to deliver updates in real-time to end users?
They can implement subscriptions in GraphQL. Whenever a document changes, the service triggers reverse search, identifies relevant queries, and pushes the updated results via a subscription layer. Users see the matching set of movie titles without manually polling, making the experience more dynamic.
Why not just store user subscriptions directly on each movie?
They would have to track which users subscribe to each movie, or constantly re-check all queries. That approach does not scale well in a federated architecture with many microservices. Reverse search isolates queries in one location, then uses a single percolate request to discover all matching queries for each change event.
How do you handle catastrophic failures during indexing?
They rely on standard data pipeline error handling. Failed index attempts go into a dead-letter queue. They have alerts when large numbers of records fail. They can replay those failed records once the issue is resolved, or manually fix them and re-ingest. The separation of concerns (document index vs percolate index) localizes failures so the rest of the system remains stable.