ML Case-study Interview Question: Commonsense Knowledge Graphs via LLMs for Enhanced E-commerce Recommendations

Apr 10, 2025

Case-Study question

You are a Senior Data Scientist at a large e-commerce platform. The platform needs to improve product recommendations by incorporating commonsense relationships. The user queries often hint at implicit needs or connections. For example, a user searches for “shoes for pregnant women” but could be interested in slip-resistant footwear. The platform wants to build a commonsense knowledge graph from customer interaction data (including queries, purchases, and co-purchases), filter out trivial relationships, and then enrich the recommender system with those inferred relationships. How would you architect the entire solution end to end, ensuring measurable gains in recommendation performance?

Detailed solution

Data for constructing the knowledge graph comes from user queries, subsequent purchases, and co-purchases in the same session. The goal is to find implicit relationships like “used_for_event” or “used_for_audience” that connect items to real-world scenarios or consumer attributes.

A large language model (LLM) can generate hypotheses about why certain items get paired in user journeys. The system must separate valid relationships from trivial or generic statements. A filtering mechanism retains plausible, typical connections. Remaining connections become entity-relation-entity triples in the knowledge graph.

Combining product embeddings (from queries and descriptions) with knowledge graph relationships offers context-aware recommendations. It also helps the model see beyond textual similarities by injecting real-world logic, like “slip-resistant shoes for pregnant women.” A cross-encoder that incorporates these relationships often outperforms simpler models. Before fine-tuning, the recommendation engine can see a large performance boost in macro F1 scores. After fine-tuning, the gap remains significant.

For measuring success, the system uses F1 as a key metric. The F1 score leverages both precision and recall:

Precision counts how many recommended items are correct among the total recommended. Recall counts how many correct items get recommended out of all possible correct ones. Large F1 jumps indicate the knowledge graph’s relationships provide valuable contextual cues that typical recommenders miss.

Under-the-hood approach

Data gathering and preprocessing: Collect query-purchase pairs within fixed time windows or click counts. Identify co-purchased products from single sessions. Remove noisy outliers (like category mismatches).
Generating candidate relationships: Prompt an LLM to propose short rationales. If it returns generic or meaningless answers, filter them out. Keep those that propose connections like “used_for_function” or “capableOf.”
Human validation: Annotate representative pairs based on plausibility and how typical they are. Use these labels to train a classifier that can filter the large set of candidate relationships.
Expanded instructions to LLM: Ask it to re-check the relationships under more precise instructions gleaned from the validated data. This step refines entity-relation-entity triples that go into the final knowledge graph.
Recommendation model integration: The cross-encoder model accepts query-product pairs and relevant knowledge graph relationships. It outputs refined relevance scores. This approach consistently outperforms baseline models.
Evaluation: Run offline experiments on a labeled dataset from past user queries. Freeze the encoder first, measure improvement. Then fine-tune the encoder on a subset of the data, measuring new gains. Compare macro F1 and micro F1 scores with and without knowledge graph relationships.

Code snippet sample

Below is a minimal illustration of how you might integrate relationship data. This is a rough outline in Python:

import torch
import torch.nn as nn
import torch.optim as optim

class CrossEncoderModel(nn.Module):
    def __init__(self, encoder):
        super(CrossEncoderModel, self).__init__()
        self.encoder = encoder  # Pre-trained or custom model
        self.classifier = nn.Linear(self.encoder.output_dim + extra_relation_dim, 2)

    def forward(self, query_text, product_text, relations_vector):
        encoded_query = self.encoder(query_text)
        encoded_product = self.encoder(product_text)
        concat_vec = torch.cat([encoded_query, encoded_product, relations_vector], dim=-1)
        logits = self.classifier(concat_vec)
        return logits

# Suppose we have tokenized_data with relation_vectors
model = CrossEncoderModel(encoder=some_pretrained_encoder)
optimizer = optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(epochs):
    for batch in data_loader:
        query_ids, product_ids, rel_vec, labels = batch
        logits = model(query_ids, product_ids, rel_vec)
        loss = loss_function(logits, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

In this snippet, relations_vector is derived from your knowledge graph relationships mapped to numerical embeddings. The rest is straightforward fine-tuning.

Potential follow-up questions

How would you handle the scalability of the knowledge graph?

The knowledge graph might explode in size if you simply gather all co-purchase pairs. Restrict the graph to higher-frequency connections or rely on classifiers to keep only plausible relationships. Store these relationships in a graph database that supports distributed storage and parallel queries. Partition or shard the graph based on product categories or relationship types. Cache frequently accessed subgraphs in memory for low-latency retrieval during recommendations.

What happens if the LLM starts generating unrealistic pairs?

In practice, LLMs sometimes produce invalid inferences. Protect against this by (1) performing semantic similarity checks that compare the LLM’s output to the prompt; (2) comparing the generated triple to known domain constraints, such as linking shoes to “used_for_audience: pregnant women” but filtering out a triple linking cameras to cooking events; (3) verifying typicality with real user interaction data. Repeated retraining of the filtering classifier on new human-validated data improves robustness.

Could you integrate this with personalized recommendations?

Yes. You can fuse user-specific behavior data with the knowledge graph. For example, if a user frequently buys camera gear and queries “best lens protection,” the system might surface products connected to photography safety gear. The knowledge graph relationships add contextual edges that highlight camera cases, screen protectors, or lens hoods as “capable_of protecting camera.” Combine these contextual relationships with user embeddings from past behavior to refine the final ranking.

How do you ensure the model remains performant when adding new data?

Use incremental updates. Schedule batch processes to re-check relationships on new query-purchase logs. Update the knowledge graph periodically without retraining from scratch. Keep a separate pipeline that filters newly generated triples with the existing classifier. Fine-tune the recommendation model if the distribution of user queries shifts significantly. Have an internal threshold for performance metrics. If F1 or other measures degrade, run more frequent refresh cycles.

How do you confirm that better F1 leads to real business impact?

Beyond the offline F1, run online tests. Randomly sample a portion of users into an A/B experiment. Compare clickthrough rates, purchase rates, or session-level conversion. Measure longer-term metrics like average order value or user satisfaction. Confirm the correlation between better offline metrics and actual buyer engagement. Scale to broader user segments if A/B tests confirm positive gains.

How would you explain the approach to non-technical stakeholders?

Frame it as harnessing real-world context to make recommendations that fit users’ life situations. The knowledge graph is like a map connecting products to meaningful use cases. It automatically picks out hidden correlations. Demonstrate it with actual examples: “Query: shoes for pregnant women → slip-resistant shoes.” Show them the data-backed performance charts and the direct increase in successful purchases or user satisfaction.

This completes the case-study question and its exhaustive solution.

Rohan's Bytes

Discussion about this post