ML Case-study Interview Question: Cascade Bandits for Bias-Corrected E-commerce Homepage Item Ranking

Apr 11, 2025

Case-Study question

A large e-commerce platform wants to improve the ranking of items on its homepage. They track user clicks on items but only receive limited feedback. They suspect that a greedy approach causes the system to repeatedly expose popular items while ignoring attractive but rarely shown items. They also notice strong position bias: users tend to click the first interesting item they see and stop. How would you design a bandit-based solution that balances exploring new items and exploiting current top performers, while also correcting for position bias to find the best overall ranking for each user context?

Complete In-Depth Solution

A ranking problem involves arranging a set of items in an ordered list to maximize an objective, often user engagement or clicks. In our scenario, the objective is to find the best subset of items (or the best ordering of items) from a larger pool. Implicit feedback (clicks) can be noisy. Greedy methods risk missing other candidates that might yield higher click-through rates if adequately explored.

Core Bandit Framework

Contextual bandits can efficiently solve this by balancing exploration (trying less-known items) and exploitation (showing known top performers). The bandit concept treats each possible ranking as an “arm.” When the system shows a ranking and observes clicks, it updates its understanding of the best arms.

However, a ranking is a permutation of items. This means the set of possible arms can be enormous. Structured bandits solve this by learning from sub-arm feedback. If a certain item is repeatedly clicked, that item’s estimated attraction probability improves, which benefits all rankings containing that item.

Position Bias and the Cascade Model

Position bias means early positions get more attention. The cascade model posits users scan the list from top to bottom, clicking the first appealing item and then stopping. Items above a click are deemed non-attractive for that user, and items below are ignored. This approach simplifies feedback and helps the model correct for position bias.

Under the cascade assumption, the probability of a click at position k equals the item’s attractiveness times the probability that positions before k are not clicked.

Here a_{k} is the attractiveness probability of the item in position k. The cascade assumption says users examine items in sequence until they click. Attractiveness depends on the item itself, not its position.

Linear Combination of Features

To generalize learning across items and contexts, assume each item’s attraction probability is a linear function of known features z_{item} and an unknown parameter vector beta. The model is:

z_{item} can encode item characteristics (category or other properties) and user/context features (user type, location, etc.). By learning beta, the system estimates how likely each item is to attract a click.

CascadeLinTS (Cascade Linear Thompson Sampling)

Thompson sampling randomly samples a plausible beta from its posterior, computes each item’s estimated attractiveness, then forms the ranking of top items. Observed clicks update the posterior distribution to refine the next sampling.

Algorithm steps:

Sample beta from a normal distribution capturing current beliefs.
Compute each item’s attraction probability = beta^T z_{item}.
Rank items by that estimated probability.
Show the top K items on the page. Observe clicks or no clicks.
Update the posterior using the new feedback so that next time beta sampling is more accurate.

Exploration happens because each iteration draws a slightly different beta, which may favor items that have not been shown often. If new items consistently perform well, they become favored in future draws.

Practical Flow

Data arrives as clicks from real users. The system:

Receives the context (user features, session data).
Draws a beta.
Scores all items, picks the top K to display.
Observes user behavior. If the user clicks item j at position p, items at positions < p are negative for that user, and items at positions > p have no feedback.
Updates the model to move beta in the direction that increases the likelihood of the observed click pattern.

Cascading bandits handle large candidate sets better than naive bandits. Using item-level features, the system reuses learned attraction probabilities across all permutations, increasing sample efficiency.

Code Snippet Example

import numpy as np

class CascadeLinTS:
    def __init__(self, d, alpha, num_items, K):
        self.d = d
        self.alpha = alpha
        self.num_items = num_items
        self.K = K
        self.V = np.eye(d)   # Posterior covariance
        self.b = np.zeros(d) # Posterior mean term

    def sample_beta(self):
        cov = np.linalg.inv(self.V)
        mean = cov @ self.b
        return np.random.multivariate_normal(mean, self.alpha**2 * cov)

    def get_action(self, z_matrix):
        # z_matrix shape: (num_items, d)
        beta = self.sample_beta()
        scores = z_matrix @ beta
        return np.argsort(scores)[-self.K:][::-1] # top K items by score

    def update(self, clicked_item_index, displayed_items, z_matrix):
        # clicked_item_index is the index of the clicked item in displayed_items
        # For items before click: reward = 0
        # For clicked item: reward = 1
        # For items after clicked: no feedback
        if clicked_item_index is not None:
            for i in range(clicked_item_index):
                z = z_matrix[displayed_items[i]]
                self.V += np.outer(z, z)
            z_click = z_matrix[displayed_items[clicked_item_index]]
            self.V += np.outer(z_click, z_click)
            self.b += z_click
        else:
            # If no clicks, all K items have reward = 0
            for i in range(self.K):
                z = z_matrix[displayed_items[i]]
                self.V += np.outer(z, z)

This example implements a simplified version of CascadeLinTS. It updates the posterior for items before the click as negative examples, and for the clicked item as a positive example.

Follow-Up Questions

How do you ensure the model does not repeatedly expose the same items without exploring new options?

Thompson sampling balances exploration and exploitation by sampling beta from a broad posterior. Early on, beta’s posterior is wide, so items lacking data can be chosen. Each observation narrows that posterior, stabilizing around better items. But occasional draws still favor lesser-known items, ensuring exploration. This prevents the model from fixating on a small subset.

How would you address computational complexity with a large number of candidate items?

One approach is to store item-level embeddings offline. Then maintain the posterior updates in a dimension equal to the embedding size, not the total item count. The structured bandit approach learns the parameters of beta, which is dimension d, rather than enumerating all item permutations. This means the algorithm scales mostly with d rather than the entire item set size. Efficient data structures and batch updates can also reduce overhead.

How do you incorporate diversity or coverage in the ranking?

Add a diversity constraint that re-ranks items with similar attraction probabilities to avoid showing multiple duplicates of the same type. One method is a penalization term in the objective to reduce selecting items that are too similar. Another method is post-processing the top items by rearranging them based on content diversity, while preserving high-probability items near the top.

Why use the cascade assumption instead of collecting more explicit feedback?

Clicks are commonly the only available signal, especially on high-traffic pages. The cascade model is a simple and computationally tractable way to incorporate position bias. Other models (Position-Based or User Browsing models) can also handle more nuanced behaviors. In many production systems, users do click mostly on top items, so the cascade assumption approximates reality well enough for ranking tasks.

How would you evaluate and compare this approach offline before live deployment?

You can:

Use historical logs to simulate user clicks with known item exposures.
Re-run them through your bandit algorithm to see how it would have ranked items.
Compare the replayed clicks vs. actual clicks. This off-policy evaluation approximates how well the bandit could perform in production.
Also run A/B tests with a small traffic share to measure real user engagement.

What if there are multiple positions clicked in a single session?

Some sessions contain multiple clicks if users scroll deeper. Cascade bandits assume the user stops after a single click. Consider a more generalized click model like Position-Based or Dependent Click Model to handle multiple clicks. Adapting the update rules to partial feedback from multiple clicks is possible but more complex. In practice, restricting to the first click often yields enough signal for learning a good ordering.

How do you handle cold-start items?

For a new item with no historical data, the wide prior on beta initially allows random sampling. If it gets clicked, the posterior updates, raising its estimated attractiveness. If it never gets clicked, the model quickly reduces that item’s preference. You can also engineer good initial embeddings from item metadata (similar items, categories) or use an embedding from a larger recommendation system.

How do you handle changing user behavior over time?

Include time-dependent features in z_item or maintain a decaying memory of past observations so older data has less weight. You can also retrain or refresh the bandit prior. If user preferences change slowly, partial re-initialization may be enough. If changes are abrupt, consider distribution-shift techniques like sliding-window posteriors or drift detection triggers that reset or adjust the model.

Would you combine supervised learning with this bandit approach?

Yes. A supervised ranker can give a strong initial item ordering. The bandit fine-tunes it in real time. The supervised model acts as a baseline. The bandit explores around that baseline, adapting to subtle changes or missing features. Over time, the bandit’s feedback loop compensates for noise in the click signals.

How do you manage large-scale deployment?

You might:

Partition items by category and run separate bandits if mixing them is suboptimal.
Use distributed infrastructure for sampling and updating the posterior.
Implement real-time logs to quickly feed fresh data back to your model.
Cache top items for each user segment to reduce latency, updating in intervals.

This covers the main considerations.

Rohan's Bytes

Discussion about this post