ML Case-study Interview Question: Standardizing Business Classification: A RAG Pipeline Approach with NAICS Codes

Rohan Paul

Apr 20, 2025

Browse all the ML Case-Studies here.

Case-Study question

A fast-growing organization faced confusion and inefficiency due to multiple non-standard industry classification systems. Teams used an in-house taxonomy and blended external data in ad hoc ways, making it impossible to create precise, consistent, and auditable industry labels. Sales relied on broad or incorrect labels, Risk lacked consistent compliance data, and any external partner communication required messy mapping between multiple systems. The organization decided to replace its homegrown approach with a standardized taxonomy, specifically six-digit NAICS codes. They built an internal Retrieval-Augmented Generation (RAG) pipeline to generate high-quality classifications at scale. Assume you are tasked with designing and deploying this new classification system for the entire customer base.

Connect with me on X (Twitter)

Detailed Solution

Transition to a Standardized Taxonomy

Teams replaced the legacy taxonomies with NAICS codes. This improved precision for niche use-cases. Because NAICS codes have a hierarchical structure, teams could roll up broader labels from the same code for flexible reporting. This eliminated confusion around how to translate categories when comparing with external partners or different internal departments.

Building a RAG Pipeline

They built a two-stage RAG system. The first stage generated top-k code candidates based on text embeddings. The second stage used a Large Language Model (LLM) to refine that shortlist into the final prediction. This approach constrained the LLM to valid NAICS codes and reduced hallucinations.

First Stage: Similarity-Based Retrieval

They embedded both the business descriptions (e.g. business summary, website text) and each NAICS code description. They computed similarity scores and retrieved a subset of top matches. Hyperparameters such as embedding models, text fields used, and k (number of recommended codes) were tuned. They measured performance with accuracy at k.

They profiled different configurations. They balanced coverage against resource constraints. They also considered missing data from certain fields.

Second Stage: LLM Classification

They supplied the LLM with the short list of candidate NAICS codes along with relevant context. The LLM selected the most suitable final label. They tested prompt engineering variations, which fields to include, and the number of prompts. To handle large context sizes, they used a multi-step approach: first ask the LLM to narrow down the recommendations, then ask it to pick the single best code. This boosted a custom fuzzy-accuracy metric to measure partial correctness when only some digits match.

Result

They deployed the RAG model in production. It logs intermediate results for auditing and debugging. The pipeline improved classification accuracy, consistency, and interpretability. Teams gained confidence in the new industry classification, and the system scaled well to new businesses. Because the pipeline is in-house, they can tweak hyperparameters or re-check logs anytime.

Sample Code Snippet

import requests

def embed_text(text, embedding_api_endpoint):
    response = requests.post(embedding_api_endpoint, json={"text": text})
    return response.json()["embedding"]

def retrieve_top_k_codes(business_text, naics_descriptions, k):
    business_emb = embed_text(business_text, "http://embedding-api")
    scored_codes = []
    for code, desc in naics_descriptions.items():
        code_emb = embed_text(desc, "http://embedding-api")
        # Simple cosine similarity
        similarity = sum(b* c for b, c in zip(business_emb, code_emb))
        scored_codes.append((code, similarity))
    scored_codes.sort(key=lambda x: x[1], reverse=True)
    return [c[0] for c in scored_codes[:k]]

Follow-Up Question 1

How would you handle businesses with sparse or missing descriptive data when generating embeddings?

Answer and Explanation: Use multiple sources. Combine short descriptions, website metadata, user-provided data, and any third-party references. Fallback logic could try aggregated industry statistics or group-level descriptors if certain fields are absent. Experiment with fine-tuning embedding models to handle minimal text, or build specialized prompts to gather context from partial data. If a record is too sparse, default to a higher-level NAICS category and prompt the user or Sales team to confirm.

Follow-Up Question 2

How would you optimize latency for the two-stage pipeline in high-traffic scenarios?

Answer and Explanation: Use pre-computed embeddings for NAICS codes, and store them in a low-latency vector database. For businesses, cache or pre-compute embeddings when possible. Keep the top-k retrieval step in memory. Minimize context for the LLM by passing only the most relevant fields. Use smaller or quantized embedding models if resources are limited. Parallelize or batch requests to embedding and LLM services to reduce overhead. Monitor response times and scale horizontally or deploy more GPU workers when load increases.

Follow-Up Question 3

How do you mitigate potential LLM hallucinations during the second stage?

Answer and Explanation: Constrain the LLM by only allowing valid NAICS codes for final output. Prompt the LLM to confirm that the final code appears in the candidate list. Validate codes post-prediction against a set of known valid codes. If the model returns something unknown, either set a default fallback or trigger a re-prompt with the next best candidate from the shortlist.

Follow-Up Question 4

Why is it important to log intermediate outputs, and how would you debug errors?

Answer and Explanation: Logs reveal whether errors stem from retrieval or from LLM classification. Inspect each stage: if the correct label was retrieved but wrongly rejected, investigate the prompt or LLM temperature settings. If retrieval missed the correct label entirely, analyze embedding coverage, missing fields, or search hyperparameters. Examine the logs for outliers. Re-run the pipeline locally with the same logs to replicate and fix the error. This approach ensures fast iteration and continuous model improvement.

Follow-Up Question 5

How do you select an appropriate “k” in accuracy at k, and how do you prevent trivially recommending all possible codes?

Answer and Explanation: Balance coverage (higher k) with retrieval specificity (lower k). Conduct a grid search of k and measure the full system performance, not just top-k accuracy. If k is too large, the LLM sees irrelevant choices and may degrade in accuracy. If k is too small, the correct code may not appear. Benchmark fuzzy accuracy or standard accuracy across different k values and pick the sweet spot. Keep a limit to avoid suggesting every code in the database.

Follow-Up Question 6

How would you ensure adaptability for evolving NAICS codes or new business types?

Answer and Explanation: Periodically update the NAICS knowledge base as official changes occur. When encountering new industries, add new codes or synonyms to the knowledge base. Re-run embeddings for the updated list and store them for retrieval. Add specialized logic for rarely seen domains. Log real-world classification mismatches or user feedback, then refine the pipeline. Maintain versioning so the system can revert to prior stable configurations if needed.

Follow-Up Question 7

How would you expand beyond industry classification to other attributes (like sub-category or risk labels)?

Answer and Explanation: Leverage the same RAG framework. Build specialized knowledge bases for each attribute, embedding relevant definitions or label descriptions. Use business embeddings as before, retrieve top matches, then refine with an LLM for the final classification. Each attribute can have its own fine-tuned pipeline but re-use the main infrastructure. Maintain logs and hyperparameter tuning to handle unique classification challenges for each additional attribute.

Rohan's Bytes

Discussion about this post