ML Case-study Interview Question: Designing a Multi-Agent RAG System for Accurate API-Driven Q&A on Professional Platforms
Browse all the ML Case-Studies here.
Case-Study question
You have been tasked with building an AI-powered feature for a large professional platform. Members can pose open-ended questions about posts, companies, or jobs, and your system should generate concise, context-aware answers. The system must call various internal and external APIs to fetch relevant data such as user profiles, company info, or search results. You have multiple specialized AI “agents” (for job assessment, post summaries, general knowledge, etc.), and a routing layer must decide which agent should handle each query.
Your goal: design a retrieval augmented generation (RAG) pipeline that uses large language models for answer generation, but also relies on internal APIs for up-to-date, accurate data. You must ensure minimal hallucinations, consistent tone and style, and maintain low latency at scale. You also want a streamlined, shared code base that multiple teams can build upon while keeping the user experience consistent.
How would you:
Set up a multi-agent system with routing, retrieval, and generation steps.
Incorporate internal and external data sources into your generative pipeline (e.g. a “people search” service or a web search API).
Evaluate answer quality, detect hallucinations, and mitigate them.
Manage high throughput with low latency and handle cost and infrastructure constraints.
Maintain a uniform user experience when different agents handle follow-up questions.
Explain your plan at a high level and outline any technical optimizations or design choices to ensure robust performance and user satisfaction.
Proposed Solution
Overall Architecture
Split the pipeline into three major phases: routing, retrieval, and generation. A routing layer inspects each query and chooses which specialized agent (job assessment, post summary, general knowledge, etc.) should respond. Each agent calls specialized APIs to gather data and feed that data into a large language model (LLM) to produce a final answer. This structure speeds up development by letting different teams own different agents.
Routing
Use a classification model or carefully engineered prompts to identify which agent is relevant for a user query. The routing model or prompt checks keywords, intent, and context. If the query pertains to evaluating job fit, forward it to a job assessment agent. If it is about summarizing a post, forward it to a post summarization agent.
Keep routing as lightweight as possible to maintain low latency. A smaller model or short prompt with few tokens is preferable. This approach preserves GPU capacity for more complex tasks downstream.
Retrieval
Have each agent gather data from relevant internal and external APIs. Implement a RAG pattern by injecting retrieved content directly into the LLM prompt. For instance, the job assessment agent calls an internal “people search” service to see member attributes, plus an external search API to retrieve company data. The resulting information is collated into a structured “dossier” passed to the generation step.
Embed each piece of retrieved data and store the embeddings. If a query recurs or is very similar, retrieve previously used embeddings and relevant text. This approach functions as a lightweight, in-memory “fine-tuning.” Agents see real, verified context and remain grounded.
Generation
Prompt a larger LLM with the retrieved context. Instruct it to synthesize a concise, empathetic answer without hallucinating. Provide instructions to adhere to a specific style and to present the results in a structured format. The model’s generation is streamed to the user’s device to reduce perceived latency.
Chain of Thought (CoT) can improve correctness but be mindful of token overhead. Long CoT sequences can inflate generation time. Set up a thoughtful trade-off between the thoroughness of the chain of thought and user-facing latency.
Evaluation and Quality Control
As queries scale, manual quality checks become the bottleneck. Develop a three-tiered approach:
Engineer Self-Check: quick, informal tests on a small batch of queries for immediate feedback.
Annotators: a specialized team with guidelines on style, empathy, correctness, etc. They produce daily metrics (e.g. hallucination rate, user satisfaction).
Automatic Evaluation: a model-based evaluator that scores new outputs for coherence, harmful content, or factuality. Retrain it on ground-truth examples to accelerate iteration.
Use prompt engineering to reduce hallucinations. Instruct the LLM to say “I do not have enough information” when lacking context. If the LLM returns an invalid or incomplete output, quickly re-prompt or run a defensive parser to fix common mistakes (e.g. invalid YAML).
Calling Internal APIs
Wrap each RPC API in a “skill” object with an LLM-friendly schema: clearly document input parameters and output format in ways that the LLM can understand. The LLM returns parameters in structured YAML, and a parser then calls the actual RPC service. A defensive YAML parser patches predictable LLM mistakes before parsing.
import yaml
import re
def robust_yaml_parser(response_str):
# Basic fix-ups, removing invalid lines or bad special characters
patched_str = re.sub(r'[^\S\r\n]+:', ':', response_str)
# More advanced corrections as needed
try:
return yaml.safe_load(patched_str)
except:
return None
Keep track of the 10% of cases where the LLM does not produce valid YAML, patch them on the fly, and reduce it to a negligible rate.
Capacity and Latency
Deploy large models behind load balancers and distribute queries across a GPU cluster. Stream tokens as soon as they are generated. Avoid sending huge prompts for simple queries. Maintain an asynchronous, non-blocking pipeline so threads do not stall. Tune concurrency settings and watch the ratio between time-to-first-token (TTFT) and time-between-tokens (TBT).
Move repeated or simpler tasks to smaller in-house models to save cost. If a user frequently asks for a short factual snippet, respond using a distilled, cheaper model. Only invoke the largest models when absolutely needed.
Maintaining a Uniform Experience
Coordinate prompts, style guidelines, and user interface (UI) across all agents. Store the conversation history so the user does not see abrupt context switches when a new agent takes over. A shared “prompt template” can unify the style and keep the user experience cohesive.
Use a “horizontal” engineering team to manage the global pipeline, evaluation frameworks, and UI templates. Have “vertical” teams each own one agent, guaranteeing autonomy but preserving a single codepath for how prompts, routing, and generation proceed.
Q1: How would you measure success for such a system beyond just manual annotation?
Measuring success requires a combination of automatic metrics and real-time user signals. Train an in-house evaluator to predict quality, factual correctness, or style adherence. Cross-validate it against human ratings. Track user engagement signals such as click-through rates on suggested follow-up questions, dwell time on answers, or the rate at which people request more detail. Monitor negative feedback metrics like “report content” or “this was unhelpful.” This combination of signals and model-based evaluation allows faster iteration than manual annotations alone.
Q2: How would you handle a sudden increase in traffic that strains LLM capacity?
Use auto-scaling on GPU-backed instances to handle load spikes if capacity is available. Implement graceful degradation. If usage spikes beyond GPU capacity, route simpler queries to smaller models or a fallback system. Cache frequently requested answers so repeated queries can be served without re-running expensive generation. If needed, serve partial answers or break up the conversation flow to distribute GPU usage. Monitor queue lengths and keep concurrency settings within safe limits.
Q3: How do you reduce hallucinations when the LLM has incomplete data?
Inject retrieved context in the prompt. Emphasize with explicit instructions that the LLM should respond with “I do not have the information” if data is missing. For highly sensitive or fact-intensive queries, run a specialized “factual verification” step: after generation, parse the output and compare named entities, dates, or claims to retrieved data. If mismatches occur, automatically refine or shorten the final answer. Use domain-specific disclaimers if the system is not entirely certain.
Q4: How would you approach fine-tuning?
Collect diverse training examples of queries and good responses. Include examples that address corner cases, ambiguous queries, or highly factual questions. Perform supervised fine-tuning first. Then optionally use reinforcement learning with human feedback (RLHF) to optimize for style, correctness, and empathy. Set up A/B tests for changes in the fine-tuned model. Ensure that carefully curated data covers user persona diversity and typical queries. This approach yields a more robust model than just prompt engineering alone.
Q5: What if different teams produce inconsistent styles or partial solutions?
Maintain a global style guide and shared prompt templates. Keep a repository of agent prompts with standardized instructions and references. Adopt a cross-team review process where each new agent or feature must align with global guidelines. Provide a centralized service that wraps each agent’s outputs with consistent formatting, disclaimers, or relevant follow-up questions. Log user feedback at a global level so style mismatches surface quickly. This alignment ensures a uniform feel even though teams work in parallel.