ML Case-study Interview Question: Accelerating Data Discovery: LLMs, Semantic Search, and AI Chatbots for Data Lakes.

Rohan Paul

Apr 16, 2025

Browse all the ML Case-Studies here.

Case-Study question

A large technology enterprise has a data lake with hundreds of thousands of tables, plus event streams and feature stores. Data analysts often waste days looking for the right dataset, resorting to asking colleagues on internal chat platforms. The data catalog tool only provides basic keyword search, lacking the ability to handle contextual queries. Documentation coverage is also minimal, making it hard for users to understand each table’s purpose. Propose a scalable end-to-end approach to transform data discovery into a near-instant experience using Large Language Models. Explain how you would improve search relevancy, automate documentation generation, and integrate an AI chatbot with the catalog and chat platforms to help analysts quickly find data. Outline your plan, the technical underpinnings, and how you would assess success. Include your architecture design, any model tuning, and details on system prompts or embedding storage. Describe challenges you anticipate, how you plan to mitigate them, and the metrics you would use to measure adoption and performance.

Connect with me on X (Twitter)

Detailed Solution

Architecture and Data Flow

A straightforward architecture uses an enhanced metadata search system, an expanded documentation source, and an AI chatbot. The workflow starts with ingestion of table and column metadata into a centralized store. Documentation is auto-generated for tables lacking descriptions, using a Large Language Model. The documentation, along with structural metadata, is indexed and stored in the catalog. The chatbot layer queries these indices to serve user requests in real time.

Search Improvements

A two-step approach is used. First, tune the existing keyword-based solution by boosting important tables, deboosting deprecated ones, hiding irrelevant entries, and refining relevancy parameters. Next, introduce semantic search by embedding queries and documents in a high-dimensional vector space. The search ranks candidates based on similarity. This handles vague queries like “rides in major cities” or “taxi type selection.”

Embedding Similarity

A standard approach is computing cosine similarity between query vector u and document vector v.

Here, s is the similarity score, u.v is the dot product, and ||u|| and ||v|| are magnitudes of the respective vectors. The indexing system stores embeddings for each dataset’s metadata, including AI-generated documentation, enabling deeper semantic matching.

Documentation Generation

A prompt-based solution uses a Large Language Model to generate table descriptions from schema fields and sample records. The system stores these AI-generated documents under an “AI-generated” tag until a subject-matter expert approves or revises them. This approach increases coverage for critical tables, improving clarity and trust.

AI Chatbot Integration

A chatbot integrates with the enterprise’s chat platform. For textual queries (for instance: “Which tables contain aggregated metrics on driver earnings?”), the chatbot receives the user’s prompt, retrieves or re-ranks candidates from the metadata store, and returns relevant datasets or direct links. The chatbot references the embedded table documentation, effectively handling nuanced, context-rich queries.

Example Code Snippet

Below is a simplified Python approach for handling a semantic search request:

import numpy as np

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def semantic_search(query_embedding, table_embeddings, top_k=5):
    scores = []
    for table_id, embed in table_embeddings.items():
        score = cosine_similarity(query_embedding, embed)
        scores.append((table_id, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

# main loop
query = "How to find driver earnings aggregated by city?"
query_emb = model.encode(query)  # LLM-based sentence embed
results = semantic_search(query_emb, table_embeddings)
print("Top matching tables:", results)

The chatbot would then display these “Top matching tables” to the user with links to documentation.

Metrics and Validation

Key metrics include click-through rate, time-to-discovery, and coverage of approved documentation. User surveys help measure perceived search quality. Event logs measure how often users refine queries or abandon sessions. These data points help refine the model, system prompts, and metadata coverage.

Potential Follow-Up Questions

How to handle sensitive data in the chatbot?

Ensure the chatbot recognizes and obeys column-level access controls. Build a policy layer that filters or masks sensitive columns from search results. Mark restricted tables in the metadata catalog, so the system only returns authorized data. Maintain audit logs of user queries and returned data.

How to address model hallucinations?

Use grounded generation. Retrieve relevant metadata from the search index before formulating a response. Include system prompts that direct the model to base its output strictly on retrieved facts, not to fabricate data. Log user feedback. Retrain and refine prompts as necessary when false answers are reported.

How to manage stale documentation?

Set up a regular refresh pipeline. Monitor table usage and schema changes. Trigger re-generation or human review if a table’s schema differs from the stored representation. Mark old documentation as outdated if the table undergoes significant modifications. Automate a process that prompts domain owners to review any changes.

How to evaluate performance over time?

Measure improvements in click-through rates, user satisfaction, and time saved. Track monthly active users in the data discovery tool. Observe how often the AI-generated documentation is accepted or updated by domain experts. Conduct random spot checks on query logs to verify correctness and relevance. Use these signals to refine the system iteratively.

Rohan's Bytes

Discussion about this post