ML Case-study Interview Question: LLM-Powered Natural Language to SQL with Domain-Specific Metadata

Rohan Paul

Apr 21, 2025

Browse all the ML Case-Studies here.

Case-Study question

You are a Senior Data Scientist at a large organization with multiple business verticals. Teams need quick answers to data questions using natural language without manually creating SQL queries. The current setup involves many tables, columns, and specialized metrics, spread across different domains like a food marketplace, an on-demand delivery service, and a quick-commerce vertical. Product Managers, Business Analysts, and Data Scientists require fast, accurate results without waiting for analytics teams. Design a system that transforms natural language prompts into SQL queries and returns executed results. Consider how you would handle user authentication, metadata management, query accuracy, integration with Slack or other messaging platforms, and alignment with business-specific data contexts.

Connect with me on X (Twitter)

Proposed solution

System Overview

A text-to-SQL workflow can be set up with a Large Language Model (LLM) to parse natural language questions and generate SQL queries. The system receives a user’s prompt in Slack, routes it to a backend that fetches relevant metadata from a knowledge base, and provides context to the LLM. The LLM produces the SQL, which is then validated and executed on a data warehouse. The final result returns to Slack.

Partitioning the Data

Separate each domain into “charters” so each partition only includes relevant tables, columns, and business-specific context. This reduces confusion when the model generates SQL. Each charter has its own metadata that defines metrics, schema details, and known reference queries.

Metadata Collection

Metadata provides table names, column names, column descriptions, example queries, and pre-validated metrics. Clear metadata is essential for disambiguating user queries and guiding the model to the correct tables.

Model Pipeline

Break down the user prompt step by step. First, parse the question to identify relevant metrics. Then map those metrics to required columns and tables. Use embedding-based lookups and LLM-based filtering to retrieve only the columns that matter. Append any stored reference queries that match the requested metrics. Combine these elements into a final structured prompt for the LLM. After the LLM generates SQL, run it on the warehouse. If an error occurs, send the error context back to the LLM for a retry. If still not successful, share partial results with the user so they can adjust manually.

Example Code Snippet

import snowflake.connector
import openai

def generate_sql(prompt, knowledge_base, reference_queries):
    # Identify relevant metrics and columns
    relevant_metrics = find_metrics(prompt, knowledge_base)
    relevant_columns = find_columns(relevant_metrics, knowledge_base)
    relevant_refs = find_similar_queries(prompt, reference_queries)

    # Construct structured instruction for LLM
    structured_prompt = f"""
        Metrics: {relevant_metrics}
        Columns: {relevant_columns}
        Reference Queries: {relevant_refs}
        User Prompt: {prompt}
        Return only valid SQL.
    """
    # Call LLM
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=structured_prompt,
        max_tokens=512
    )
    return response["choices"][0]["text"]

def execute_sql(sql_query, user_creds):
    # Ensure user_creds has the right privileges
    conn = snowflake.connector.connect(**user_creds)
    cursor = conn.cursor()
    try:
        cursor.execute(sql_query)
        return cursor.fetchall()
    except Exception as e:
        return str(e)

Result Validation

Auto-run the generated query on the warehouse with the user’s credentials, ensuring only authorized data is returned. Log any errors, then attempt an LLM-driven correction if feasible. If it keeps failing, send a final fallback response and partial SQL to the user.

Adoption and Feedback

Users like Product Managers and Data Scientists adopt it through Slack commands to fetch aggregated metrics, investigate anomalies, and verify results. Direct Slack feedback on query accuracy helps refine the model. Automate updates to metadata whenever a new domain or table is added.

How would you handle large or ambiguous prompts?

An approach is to break the question into smaller sub-questions. Identify known keywords and create structured queries for each domain. Use repeated LLM calls if the user’s prompt exceeds token limits. If there is ambiguity, prompt the user with a clarification request in Slack.

How do you ensure secure data access?

Require each user to authenticate with the warehouse using their own credentials. The query runs only if the user has table privileges. The system intercepts queries at the middleware level and blocks any unauthorized accesses.

How do you improve accuracy over time?

Collect each query and user feedback. Identify which parts of the prompt or table metadata caused errors. Expand the knowledge base with more reference SQL queries. Enhance the LLM pipeline to incorporate user corrections as new data points. Increase coverage for domain-specific phrases in embeddings.

How do you handle time-series or trend-based questions?

Embed logic in the pipeline to detect “trend” keywords like daywise or monthly. Augment the final SQL with time-based grouping or window functions. Maintain canonical reference queries that group by date or track moving averages. Map user-specified durations (like “last week”) to date filters with the correct date format.

How do you handle concurrency and scaling?

Deploy the text-to-SQL pipeline as a stateless microservice that can spin up multiple instances. Use serverless or container-based orchestration for concurrency. Cache embeddings for frequently asked queries to reduce repetitive computations. For the final query execution, maintain a shared pool of warehouse connections.

Why is a knowledge base plus retrieval crucial here?

It narrows the scope of the LLM’s search space. Without domain-specific metadata, the LLM might produce inaccurate queries or choose the wrong tables. The knowledge base ensures the model remains accurate for each domain. It also reduces token usage, which reduces cost and latency.

Rohan's Bytes

Discussion about this post