ML Case-study Interview Question: Scalable A/B Testing Pipeline for Food Delivery Menu Conversion Optimization
Browse all the ML Case-Studies here.
Case-Study question
You lead a data science team at an online food delivery platform aiming to optimize the arrangement of menu categories and products to improve user conversion. The team runs A/B experiments across multiple countries to compare a baseline ordering (A version) versus an alternative ordering (B version). You must design and implement an end-to-end system to create and deploy B versions, then track and evaluate the experiment’s performance daily. You must handle large-scale vendors, potentially hundreds of thousands of API calls, and maintain low failure rates in the face of vendors frequently updating their menus.
Propose how you would structure the data pipeline, optimize the system for speed and reliability, and determine success metrics. Describe potential bottlenecks and how you would address them while ensuring the experiment’s validity. Outline a daily operational workflow for scaling across many countries with minimal failures.
Detailed Solution
Overall Approach
Use a workflow management tool to orchestrate tasks that fetch, transform, and send experimental B version data to the internal API. Store business logic for the B versions in a central analytics warehouse, then dynamically fetch those results to build payloads for each vendor. Send payloads through a stable scheduling mechanism to handle concurrency efficiently.
Data Pipeline Workflow
Define a Directed Acyclic Graph (DAG) in a scheduling platform like Apache Airflow. Split the workflow into tasks: Fetch B version data from the analytics warehouse. Generate payloads per vendor for categories and products. Dispatch payloads to an internal API that updates the order in real time.
Break this DAG into smaller concurrent units. Introduce pagination so each task only processes a subset of vendors. Prioritize bigger or more dynamic countries first to reduce menu mismatch failures. Configure a dedicated resource pool and allocate extra CPUs so tasks can run in parallel without long queue times.
API Invocation and Error Handling
Send data via synchronous requests to the internal API, but batch them by pages when possible. Maintain idempotency by tagging payloads with unique reference IDs so the system can identify duplicates or partial updates. Use robust retry policies for transient network issues. Capture any mismatched payloads (e.g. vendor menus changed before updates) and log them as failures. Track these failures to improve scheduling strategies or page sizes.
Reducing Failure Rates
Keep the time between data extraction and API call short. If a vendor updates its menu after the data is fetched, the B version’s payload may fail. By increasing concurrency, the entire job completes faster, minimizing the interval where the menu can become stale. Prioritize larger sets of vendors earlier, reducing the probability they drift from the extracted state.
Measuring Success with A/B Testing
Measure conversion metrics such as click-through rates or purchase completion rates under the A and B versions. Estimate differences in these metrics between A and B. Use a standard statistical test for comparison of proportions. One typical z-test formula is:
Here, p_A is the conversion rate for the baseline (A version), p_B is the conversion rate for the experimental (B version), n_A and n_B are sample sizes for A and B, and p is the pooled proportion ( (p_A * n_A + p_B * n_B) / (n_A + n_B) ). A large absolute value of Z indicates a statistically significant difference between the two versions.
Compare confidence intervals and decide whether to continue with B or revert to A. Control the experiment so that updates happen daily or more frequently if infrastructure allows. Keep an eye on potential distribution shifts across countries.
Implementation Example in Python
Use Airflow to define the DAG:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def fetch_b_versions(**kwargs):
# Query warehouse with OFFSET, LIMIT for pagination
pass
def send_b_versions(**kwargs):
# Build payload, call internal API, log failures
pass
dag = DAG(
dag_id='menu_ranking_experiment',
start_date=datetime(2025, 1, 1),
schedule_interval='@daily',
catchup=False,
default_args={
'retries': 3,
'retry_delay': timedelta(minutes=5),
},
)
t1 = PythonOperator(
task_id='fetch_b_versions',
python_callable=fetch_b_versions,
dag=dag
)
t2 = PythonOperator(
task_id='send_b_versions',
python_callable=send_b_versions,
dag=dag
)
t1 >> t2
The first task retrieves and segments the B version data (fetch_b_versions). The second task processes each page, sends payloads to the internal API, and applies concurrency to accelerate the workflow.
Future Extensions
Implement dynamic pagination to respond to market changes, so you can adjust page sizes without manual intervention. Introduce batch APIs that accept payload arrays to reduce overhead. Split the experiment by country into separate DAGs to isolate issues. Allow a systematic switch from B to A or from A to B once significance is confirmed.
Possible Follow-up Questions
How would you handle rolling back the B version to the A version if system metrics drop?
Reserve a separate DAG or set of tasks that can overwrite the experimental configuration with the old baseline if automated alert thresholds are violated. Keep the old metadata in the analytics warehouse so the rollback can be executed with minimal downtime.
What strategies would you apply if the error rate spikes due to stale menu data?
Shorten the time between data retrieval and API calls. Increase concurrency so the DAG finishes faster. Lower page size to reduce conflicts. Retry at more frequent intervals. If certain countries are more prone to menu changes, schedule them first.
How do you ensure the A/B experiment maintains statistical rigor as you expand to new markets?
Maintain the same random assignment logic. Capture user-level or segment-level assignment to preserve consistency. Confirm that each country meets sample size requirements. Combine or stratify analyses if needed. Avoid partial rollouts that bias the experiment.
What if the baseline and B version differ in multiple factors, not just ordering?
Break complex experiments into smaller tests. Isolate each factor (like layout vs. content changes) to ensure each effect is measured accurately. Use multi-armed bandit approaches if there are many variations. Track confounders in your analytics warehouse.
How do you handle concurrency limits if the internal API cannot handle so many parallel requests?
Implement a rate limiting proxy. Queue requests if concurrency hits a threshold. Spread calls over time segments. Explore asynchronous patterns: post payloads to a message broker, have a microservice process them in parallel, then respond with status updates.
Would you consider advanced designs like multi-armed bandits?
Yes, if you want to iterate rapidly. Bandits allocate more traffic to higher-performing variants on the fly. However, they can complicate the analytics pipeline because the distribution of traffic becomes dynamic. For large-scale daily experiments, maintain consistent randomization to preserve clarity.
How do you confirm the results are stable across days or weeks?
Ensure the experiment runs for a suitably long period. Monitor the daily aggregated differences. Apply a stopping rule when the confidence level and sample size conditions are satisfied. Check drift in user behavior across time or changes in funnel steps.
How do you mitigate the risk of user fatigue from rapid changes in menu order?
Set a limit on how frequently the arrangement changes. Use a consistent assignment so returning users see the same version. If a user sees the menu reordering daily, it might cause confusion and reduce trust. Monitor user sentiment alongside quantitative metrics.