ML Interview Q Series: How would you design control and test groups to evaluate a new “close friends” feature on Instagram Stories, ensuring you properly handle network effects?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Network effects complicate straightforward A/B testing because a user’s experience and engagement can be influenced not only by their own treatment assignment but also by the treatment status of their friends. Traditional A/B testing assumes that the outcome of one user does not depend on whether another user is in the treatment or control group (often referred to as the Stable Unit Treatment Value Assumption). For social platforms, this assumption is not always valid.
Key Concept of Average Treatment Effect and SUTVA
When we measure the effect of a new feature, we often want to compare the average outcome when users have access to that feature versus when they do not. In a simplified scenario (ignoring network spillovers), the Average Treatment Effect can be expressed in big notation as:
Where Y(1) is the outcome (such as increased usage, engagement time, or number of Stories shared) for the treatment condition, and Y(0) is the outcome for the control condition. The notation 1 and 0 simply indicates presence or absence of the treatment. However, in a social network setting, user i’s outcome can be affected by whether user j is also in the treatment group, which violates SUTVA. Hence, we need to carefully design the experiment to address potential contamination or spillover.
Cluster (or Community) Level Randomization
One common technique is to randomize at the cluster or community level. In a social graph, users naturally form clusters of highly interconnected nodes. If you assign whole clusters to either test or control, you reduce the chance that a user in the treatment group has many close friends in the control group. This approach:
Helps maintain consistency of the new feature exposure among friends.
Minimizes contamination, because entire clusters experience the same feature set.
However, a major challenge is that if clusters vary dramatically in size, one large cluster in the treatment group can overshadow multiple smaller clusters in the control group, potentially skewing your estimates of the effect. It can also reduce statistical power because you often have fewer distinct “units” (clusters instead of individual users).
User-Level Randomization with Guardrails
Another approach is user-level randomization where each user is individually assigned to treatment or control. However, to mitigate severe network contamination, you can impose constraints such as “if a user is in the treatment group, a certain fraction or subset of their friends must also be in the treatment group.” This ensures that some portion of their immediate network has consistent exposure to the close friends feature and captures at least a partial network effect.
The trade-off is that this design becomes more complex to implement and does not fully eliminate the possibility of cross-group leakage. If many users in the control group are surrounded by friends in the treatment group, they might indirectly be exposed to the effect of the new feature (e.g., hearing about it or receiving partial notifications).
Ghost (Shadow) Testing for Non-Friend-Facing Components
Some aspects of a feature can be tested in a “ghost” or “shadow” mode where the user can see their own behavior changes but it is not visible to others. For instance, if part of the new feature doesn’t need to be revealed to friends until you confirm it works as expected, you can deploy that portion in silent mode across a small set of users. This doesn’t fully solve the network effect challenge, but it sometimes reduces how strongly the treatment spills over to the control group.
Incremental Rollout or Geographic Rollout
In certain products, a geographic or region-based rollout is used. For Instagram, you might select a random geography and enable the close friends feature there. This localized approach forms a natural cluster (everyone in the region is in the treatment group), while everyone outside is the control group. This method can simplify the containment of spillover but can introduce biases if geographic regions differ in demographics or usage patterns.
Metrics of Interest and Spillover Effects
When analyzing results, it’s important to differentiate between:
Direct effects: Changes in behavior for a user who has the feature.
Indirect effects: Changes in behavior for a user caused by their friends having the feature.
You might observe indirect effects if users in the control group behave differently because their friends are using the close friends feature. For instance, if a user’s friend starts sharing more Stories privately, the user in the control group might see fewer public Stories and decide to post differently as well.
Accounting for indirect effects often involves building models or analyses that include the proportion of friends in the treatment group. You may also measure engagement changes over time and see if the introduction of the new feature in a local cluster correlates with usage changes for connected users.
Practical Steps to Implement
Identify connected components or clusters in your social graph.
Randomly assign entire clusters to treatment or control. For balanced experiment design, ensure that you have enough clusters on both sides to achieve good statistical power.
Alternatively, if you do user-level assignment, ensure each user has enough friends receiving the same assignment to reflect real network usage. You might sample “seed” users for the test group, then propagate that assignment to some fraction of their friend circle.
Monitor both direct engagement changes and potential indirect effects on close friends who may or may not have the feature.
Carefully track the difference in standard metrics (like daily active users, time spent, or number of stories posted) between clusters, controlling for differences in cluster sizes and characteristics.
How to Handle Data Analysis
After randomization and data collection, you will likely measure engagement outcomes for each user. Because of potential network spillovers, you can:
Compare average outcomes between treatment and control clusters to see the total effect (direct + indirect).
Possibly build a model that includes the fraction of a user’s friends who are in treatment to see how partial exposure affects usage.
Consider the possibility of capturing separate direct and indirect effects by comparing outcomes for users whose entire friend group is in treatment, partially in treatment, and wholly in control.
Potential Follow-up Questions
How do we handle extremely large connected components in the graph?
If there is a dominant cluster that encompasses a large fraction of the user base (for example, one giant connected component), randomizing that entire component into a single bucket (treatment or control) can drastically reduce the experiment’s ability to measure differences. One strategy is to look for smaller communities within that giant connected component (for instance, by applying community detection algorithms). Another approach is to use user-level randomization in those large clusters but apply guardrails (such as partial friend assignment) to reduce contamination.
If we do cluster-level randomization, could it introduce bias if certain clusters differ in usage patterns?
Yes. Different communities may have distinct usage behaviors or demographic traits, which can introduce confounding factors. One step is to stratify clusters by important attributes (size, average user engagement, etc.) before randomizing, ensuring an equal distribution of different cluster types between treatment and control. You might also measure pre-experiment engagement or other baseline metrics to confirm that both treatment and control clusters are comparable.
How do we measure the success of the close friends feature?
Success can be evaluated by examining various engagement metrics, such as the frequency of using the close friends feature, overall Story creation, time spent on the platform, user retention, or the breadth of a user’s sharing behavior. Additionally, you might track whether users add more people to their close friends list over time, how many stories are posted in the close friends context versus publicly, and whether overall user satisfaction or session length changes.
What about partial contamination if some users see glimpses of the feature from friends in the test group?
Complete isolation is rarely possible in social networks. You can mitigate some contamination by randomizing entire clusters of highly connected friends. However, partial leakage will still exist, especially when a user in the control group interacts with or hears about the feature from treatment-group friends. To address this, measure indirect exposure by capturing how many friends in the test group a control-group user has. This allows you to model the extent of contamination (for instance, if a control-group user has a high fraction of friends in the test group, their behavior might shift slightly even without direct access to the feature).
How do you analyze direct vs. indirect effects separately?
One approach is to define subgroups:
Directly treated users: Those formally assigned to the test group.
Pure control users: Those assigned to control and have minimal or zero friends in the test group.
Indirectly exposed control users: Those assigned to control but have a significant fraction of friends in the test group.
Then compare metrics among these three subgroups. The difference between the control subgroup with no direct exposure and the control subgroup with indirect exposure can quantify the spillover effect. Comparing the test group’s metrics to the pure control group’s metrics isolates the primary or direct effect.
Are there technical tips for implementing the cluster-based assignment in code?
In Python, you can use libraries such as NetworkX to analyze the user graph, detect connected components, or run community detection. For example:
import networkx as nx
# Suppose you have an edge list that represents the social graph
G = nx.Graph()
G.add_edges_from(edge_list)
# Identify connected components
connected_components = list(nx.connected_components(G))
# Randomly assign each component to treatment or control
import random
component_assignments = {}
for comp_id, comp in enumerate(connected_components):
# random.choice([0,1]) for control=0 or test=1
component_assignments[comp_id] = random.choice([0, 1])
# Now you have an assignment at the cluster level
This gives you a high-level idea of how to cluster users and assign them as a group. In large-scale production environments, you would need more efficient, distributed graph processing techniques, but the concept is similar.
What if the feature only works if both users share the feature status? (e.g., requiring friend reciprocity for the feature)
You might need to ensure that pairs of users either both have the feature or both do not. One solution is “buddy randomization,” where friend pairs are the unit of randomization rather than individuals. This complicates the design if the social graph is large and dynamic. You might also treat user pairs or friend lists as minimal clusters. All these strategies aim to ensure that your experiment accurately reflects how a user would interact with friends who also have or don’t have the feature.
By carefully choosing how you split users (or groups of users) into test and control, and by thoroughly analyzing direct and indirect outcomes, you can gain a reliable understanding of how your close friends feature will perform in a real-world social network.
Below are additional follow-up questions
How can we handle large-scale data anomalies or outages during the experiment period?
Data anomalies or outages can skew metrics in unpredictable ways. For instance, an outage in certain regions might reduce overall usage metrics, making it harder to discern whether changes in engagement are due to the new feature or the outage. It is crucial to track system health metrics—such as server response times, error rates, and downtime—alongside experiment performance metrics. If a significant data anomaly occurs, one potential solution is to exclude affected days or regions from the primary analysis, but only after carefully verifying that this exclusion does not itself introduce bias. Another approach is to conduct sensitivity analyses, in which you compare the results with and without the impacted data. These analyses can help confirm whether the main experiment conclusion is robust to such anomalies. A further pitfall is that some user segments (like new users in certain geographies) may be more severely affected, introducing unintentional differences in exposure time between test and control groups.
What if new users sign up during the experiment, and we need to incorporate them?
In a social platform, new users join continuously. They might not have any friends yet, or they might gradually add friends who are in either the treatment or control group. This dynamic can affect how the close friends feature is exposed and how outcomes are measured. One way to address this is to fix the assignment strategy for new users (for example, randomly assign them at sign-up time), ensuring that the proportion of new users in treatment and control remains roughly balanced. Alternatively, if your experiment is cluster-based, you can assign them to the cluster of the first friend they connect with, allowing them to inherit that friend’s group status. A potential pitfall is that new users often have different engagement patterns compared to existing users, which might confound the results. One strategy to mitigate this is to track these new-user cohorts separately in your analysis so that you can see if their behaviors differ substantially from the overall user base.
How do we handle real-time monitoring to decide whether to stop or adjust the experiment?
Sometimes you need to monitor metrics in near real time to ensure that a new feature does not negatively impact important user behaviors (e.g., crash rates, latency, or monetization metrics). Real-time dashboards help detect if the new feature introduces significant harm or disruption. A challenge arises when network effects make short-term data less representative, since users may take time to discover and adopt the new feature. To address this, you may implement short-term safety checks (e.g., ensuring no spike in errors or major negative shifts in usage) while still allowing the experiment to run for a sufficiently long time to capture behavioral changes caused by the network effect. If the real-time metrics show drastic negative outcomes, you might halt the experiment prematurely. However, an overly aggressive early stopping rule can lead to false conclusions (Type I error), especially if the observed effects are volatile and small sample sizes are at play early in the experiment.
How do we ensure our experiment measures the proper definition of “close friends” usage vs. other interactions?
The “close friends” feature might cause users to share content differently, and some might shift their interactions from direct messages to close-friend stories or from public stories to more private circles. Metrics should be carefully defined to distinguish between these behaviors. For example, you might measure the average number of close-friend stories posted per user, the number of times a user views a close friend’s story, or the ratio of close-friend stories to total stories. If you only track total story shares, you might miss whether the composition of sharing changed. A pitfall is conflating an overall increase in story engagement with success of the new feature, when in fact users might just be substituting one form of private sharing for another. Hence, you should examine multiple metrics to capture different facets of user behavior.
How can we detect if certain subpopulations or geographies are negatively impacted by the feature?
Although the global averages may look good, some subpopulations might have a very different experience or usage pattern. For instance, users in certain geographies or age groups might rarely use the close friends feature, or find it confusing, while others might adopt it enthusiastically. You can stratify your analysis by user geography, age bracket, or usage tier (e.g., power users vs. casual users) to see if there is a significant discrepancy in outcomes. Another pitfall arises if the social graph is heavily fragmented by geography or language, meaning that subpopulations are effectively in different networks. If the new feature only catches on in some regions, this can create confounding effects in your overall metrics. A best practice is to implement a pre-specified plan to analyze these subgroup performances and watch for disproportionate negative impacts.
How do we handle edge churn (the forming or dissolving of friendships) during the experiment?
In a social network, friend relationships are dynamic: users add or remove friends over time. If your experiment design involves cluster-level or partial friend-based assignment, these network edges might shift in the middle of the experiment. That can cause contamination if new edges connect a user in the treatment group to a user in the control group. One approach is to lock the cluster assignment based on a snapshot of the graph at the experiment’s start, ignoring subsequent edge changes for the purpose of assignment. The risk is that over time, such an approach may become outdated if a significant portion of edges changes. Alternatively, you can define a policy that re-assigns a user when they move into a new social subgraph, but this can get complicated and lead to “treatment switching,” which complicates the analysis. In practice, many companies opt for a stable assignment during the experiment’s duration, carefully noting any edge changes as part of the final analysis to see if they heavily impact the results.
How do we measure success if the feature is not immediately obvious to users or requires time to adopt?
Some features require user awareness and an adaptation period. The close friends feature might be something that only becomes useful after a user has defined or curated their close friend list. Users might also need time to understand its privacy implications. A short experiment might not capture the full adoption curve. You can mitigate this by running a pilot with a small set of users and measuring how long it takes for them to discover, configure, and use the feature. Based on these insights, you can choose an appropriate experiment duration that encompasses the time needed for typical adoption. Another subtlety is that some people may never adopt it, which affects the measured average treatment effect. You could also measure per-user “time to first use” or “time to setup,” to see if the experiment is long enough for the treatment effect to stabilize.
How do we separate user experience changes from purely technical differences, such as app performance or UI load times?
Sometimes a new feature changes the application’s code footprint, resulting in different load times or memory usage for users in the treatment group. Such technical factors might affect user engagement metrics independently of the actual user-facing functionality. To rule out these confounds, you should measure performance metrics (like latency, memory consumption, or crash rate) for both treatment and control. If there is a notable difference in these technical metrics, you could add a step in your rollout process to optimize performance or fix bugs before proceeding with the full A/B test on user experience. In addition, if the new feature requires repeated network calls or increased bandwidth usage, certain users on slow connections might experience a degraded experience, skewing results. Monitoring these technical aspects and controlling for them in analysis helps isolate the pure effect of the close friends feature itself.
How do we address privacy concerns and ensure we are not inadvertently leaking sensitive information in the control group?
When testing a feature like close friends, users may share content that is more personal, assuming it is only visible to a small circle. During an experiment, you must safeguard this data in a way that ensures no one in the control group can see the content if they are not supposed to. Additionally, you must ensure that your logging and analysis pipelines are compliant with privacy policies—meaning that any data about close-friend stories is only accessible to the teams that absolutely need it to evaluate the experiment. Potential pitfalls include accidentally exposing private stories in analytics dashboards or logs. To mitigate these risks, adopt privacy-by-design principles, minimize the data collected, and ensure your analytics cannot re-identify users who posted private content. Validate that any experiment instrumentation aligns with existing data handling and privacy regulations.
How do we interpret the results if other new features are launched concurrently?
Many large organizations deploy multiple experiments simultaneously. If another major feature or interface redesign goes live during the same timeframe, it could confound your results. For example, if a redesign significantly changes how stories are surfaced, this might influence whether and how users adopt close friends functionality. To mitigate these confounds, keep an experiment log of all concurrent major feature changes. You can also run interaction analyses that look for differences in the effect of the close friends feature depending on whether a user was also exposed to the other new feature. If possible, coordinate with other teams to avoid overlapping rollouts on the same user segments or to design experiments that explicitly account for each other’s presence. If complete isolation is impractical, consider advanced modeling techniques that estimate the combined effects of multiple concurrent changes.
How do we verify that the cluster assignment approach is robust if the network changes substantially after the experiment?
If the social graph changes significantly—due to user growth, friend additions, or deletions—your initially assigned clusters may no longer reflect real-world usage. You should measure the stability of clusters by looking at metrics like the proportion of cross-cluster edges that form during the experiment. If cross-cluster edges remain rare or minimal, your cluster-based approach likely remains valid. However, if you observe a high rate of connections between clusters, it might mean that isolation has broken down and the control group has substantial exposure to the treatment or vice versa. You can handle this situation by conducting a sensitivity analysis: limit the data to users who did not create cross-cluster edges mid-experiment and see if the results remain consistent. If the partial contamination is widespread, you might consider concluding the test earlier than planned or refining your approach to cluster assignment in a follow-up experiment.