ML Case-study Interview Question: Generating Accurate SQL from Natural Language Using RAG and LLMs
Browse all the ML Case-Studies here.
Case-Study question
You work at a company with massive operational data stored in a data warehouse. Many non-technical teams spend significant time writing SQL queries to answer business questions. You must design a system to convert plain English questions into accurate, production-ready SQL queries. The system should handle large datasets and table schemas, reduce query authoring time, and ensure minimal errors or “hallucinations” in the generated queries. How would you build, refine, and evaluate this system?
Provide your solution approach in detail, from data ingestion to query generation, including how to handle large schemas, improve relevance, prune columns, and evaluate success. Justify design decisions and show partial code or pseudo-code where relevant. Suggest optimizations to reduce false positives. Explain how you would deal with queries that break due to missing partition filters or invalid column references.
Proposed Detailed Solution
Teams often struggle with domain knowledge. A Retrieval-Augmented Generation (RAG) technique is effective. A high-level workflow includes an intent agent, table agent, column pruning, and final query generation. Each step uses a Large Language Model (LLM) in a focused manner, avoiding a single broad prompt.
System Overview
A curated set of “workspaces” groups tables and sample queries by business domain (for example, “Mobility”). When a user enters a question, an intent agent classifies it into a domain. A table agent suggests relevant tables, then prompts the user to confirm or refine. A column prune agent trims unneeded columns to reduce token size. Finally, a query generator agent produces the SQL statement and an explanation of how it arrived at that result.
Intent Agent
A dedicated LLM call classifies the prompt. This step helps narrow the search space to a single workspace. If the prompt is ambiguous, the system can propose multiple possible domains for the user’s approval. This approach reduces irrelevant data being passed to subsequent steps.
Table Agent
After domain identification, a table agent suggests tables by looking at textual signals in the question (for example, “How many rides were canceled” implies a trip table). Users verify or modify the table list. Providing human-in-the-loop ensures the system does not generate queries against the wrong data.
Column Prune Agent
Large schemas exceed token limits. A dedicated agent, using the user’s question and knowledge of the domain, filters out columns that are unlikely to be used. This lowers the cost and response latency. Some columns might contain data such as extra operational metrics or rarely used fields.
Query Generator
A final LLM call uses the pruned schemas and sample SQL queries from that workspace. It then creates the final SQL and offers an explanation of column usage, joins, and filters. It may apply custom instructions (for instance, requiring date filters). The generated output must be syntactically valid and reflect the user’s request accurately.
Example Code Snippet (Table Agent)
import openai
def table_agent_prompt(user_question, domain_tables):
prompt = (
"User question: " + user_question + "\n"
"Available tables in domain: " + str(domain_tables) + "\n"
"Suggest the relevant tables for the final SQL query."
)
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "system", "content": prompt}]
)
return response['choices'][0]['message']['content']
This function returns a list of suggested tables based on the domain. The user can then edit or confirm.
Evaluations
A curated set of question-SQL pairs tests each step. Each question is run in “vanilla” mode (the system picks the domain and tables) and in “decoupled” mode (the system is given a fixed domain and tables). Metrics track correctness of domain selection, table selection overlap (how many tables match the correct set), successful query execution, and query result correctness (non-empty output if data is expected).
Reducing Hallucinations
Prompt design includes explicit instructions. If the LLM attempts to reference a nonexistent table or column, a validation step can catch and fix. The user can iterate on the final query in chat to clarify domain details or correct filter keys.
Handling Token Issues
Pruning reduces superfluous columns. This ensures total tokens remain within model limits. High-level logic: gather the columns that strongly correlate with the user’s question. Drop everything else. This yields a small, relevant set of columns to pass into the query generator step.
Human-in-the-loop
User involvement is crucial for domain confirmation, table selection, and final validation. This flow lowers the error rate, especially in highly domain-specific queries. For large organizations, domain definitions evolve, so the system might suggest a fallback to a user for final corrections.
Follow-up Question 1
How would you address scenarios where the user enters extremely short or vague prompts?
A short prompt might be: “Get yesterday’s metrics.” That lacks the detail needed to pick specific tables or columns.
Explanation
One solution is a “prompt enhancer.” It expands the user’s vague query by asking clarifying questions or inferring likely details based on domain-specific context. For instance, the prompt enhancer might add time range clarifications or business keywords. It might consult the user or a historical log of queries to guess relevant fields. This improves the final query accuracy by giving the LLM more context.
Follow-up Question 2
How would you systematically detect hallucinated columns or tables during query generation?
Explanation
A validation agent can parse the generated SQL’s table and column references. It compares them against a known schema registry. If a mismatch occurs, the system tries a repair loop by removing invalid columns or substituting the correct ones. This agent can also cross-check partition keys or mandatory filters. If the user’s environment does not allow automated repair, it can prompt the user to fix references.
Follow-up Question 3
How can you measure query quality beyond basic syntactic correctness?
Explanation
A comprehensive evaluation includes checking if the query returns non-empty results, aligns semantically with a reference SQL, and uses correct joins and filters. Compare with a set of golden queries. Use an LLM-based similarity checker to assign a score from 0 to 1. A high score indicates consistent structure and selection of columns. Also verify that the output table row counts match expected ranges.
Follow-up Question 4
What optimization steps reduce run-time cost in a high-traffic environment?
Explanation
Caching frequent prompts and answers helps. Pruned columns save tokens and cost. A specialized vector database can reduce the similarity search overhead when matching table schemas. System-level concurrency controls or micro-batching requests can also cut latency. High-value or repeated questions can be served from a stored result set, refreshing at intervals.
Follow-up Question 5
How would you handle user feedback if the final query produces unexpected or zero results?
Explanation
A chat interface allows iterative refinement. If zero rows appear, the user might specify a different date range or remove certain filters. A feedback loop logs these corrections, improving future suggestions. The system can also ask: “The output was empty. Did you mean to filter by status=‘Completed’ instead of status=‘Finished’?” This guidance helps correct minor filter errors.
Follow-up Question 6
What strategies would you propose to ensure minimal training data drift?
Explanation
Large organizations frequently update table schemas. The system should periodically refresh vector embeddings for newly added columns or tables. For changed domains, the intent classification might need a retraining pass with the new data. Version tracking on the schema registry helps detect changes. If a domain’s schema changes significantly, the system might prompt an admin for updated table-agent guidance or fresh sample queries.
Follow-up Question 7
When is it appropriate to bypass the agents and directly prompt the LLM?
Explanation
If a user is highly technical, knows the table layout, and can phrase a precise question with explicit references, an agent-based flow may slow them down. A direct prompt might be enough in these cases. For less clear queries or new users, the step-by-step decomposition consistently reduces errors and fosters better domain matching.
Follow-up Question 8
What considerations are needed for multi-language user questions?
Explanation
Support for multiple languages requires additional translation steps. You can embed the prompt in a language detection routine. Once identified, the system either translates it to English for the LLM or leverages multilingual models. The rest of the pipeline remains mostly unchanged. Thorough domain coverage ensures that table agent and column prune agent can map the translated question correctly.