ML Interview Q Series: How can we determine whether a web page view is made by a human visitor or by an automated scraper, given a dataset of page views where each entry represents a single request?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
A critical part of identifying whether a user is a real person or a web scraper involves examining browsing patterns, request velocity, user-agent strings, and other behavior-based signals. Real humans often exhibit certain navigational flows (moving from one page to another in a coherent sequence) and remain idle for periods of time, whereas scrapers tend to perform large volumes of requests in short bursts, sometimes ignoring standard best practices such as cookie acceptance or JavaScript handling.
Because of the dynamic nature of web traffic, a purely rules-based system can be brittle. A more robust approach is to gather features from page view logs and train a classification model. For instance, we can compute features such as average time between requests, distribution of requests across distinct pages, presence or absence of JavaScript execution, or legitimate user-agent strings. With these features in hand, we can attempt to classify a given request sequence as “bot-like” or “human-like.”
When building such a model, logistic regression is a straightforward choice for interpretability. This model predicts the probability that the observed behaviors correspond to a real user. A typical logistic regression formula for probability (y=1 means a real user, y=0 means a scraper) is shown below.
Here, x_1 to x_n represent numerical values derived from user behavior: for example, x_1 might be the number of page requests in a fixed time window, x_2 might be the variance in time intervals between page hits, etc. The parameters beta_0, beta_1, ..., beta_n are the learned weights of the model. Once the weights are determined from a labeled dataset (e.g., known scrapers vs. confirmed real users), the logistic function outputs the predicted likelihood that a given sequence of requests belongs to a real user.
Feature engineering is paramount. Real people show certain patterns of navigation, such as referring pages, session durations, sporadic breaks, or certain mouse movements if JavaScript events are captured. Scrapers might not replicate these patterns and thus become distinguishable by their repetitive, high-frequency requests. In some situations, it also helps to look at aggregated statistics per IP or per session token, because scrapers often rely on constant IP usage or lacking session management.
Another subtle point is that scrapers sometimes try to blend in. They can rotate user-agent strings, use proxies to distribute requests across IP addresses, and even incorporate artificial delays. In these cases, additional signals can include anomaly detection on referrer logs, fingerprinting techniques such as analyzing font enumeration or canvas rendering if we have front-end instrumentation, and deeper analysis of device or network characteristic mismatches.
In practical systems, a real-time approach might rely on rules or thresholds for immediate filtering (e.g., dropping suspiciously high request rates) combined with a machine learning classifier for ongoing traffic monitoring. When possible, supervised training with a labeled dataset is ideal, but semi-supervised or unsupervised anomaly detection can work if there are no reliable labels.
How to Choose a Good Labeled Dataset
Reliable labels are essential for supervised learning, but it can be tricky to mark data as “scraper” vs. “human” at scale. Sometimes we have partial ground truth from known scraping IP ranges or from accounts flagged by a separate threat intelligence system. Alternatively, system logs may indicate repeated violations that reveal a bot. That said, there is always some uncertainty, and building a “gold standard” labeled set requires careful curation, such as capturing repeated behavior over time and verifying with internal security teams.
Why Session-Level Aggregation Matters
If we only consider individual page views in isolation, we might overlook crucial temporal and behavioral context. Aggregating page views into sessions (grouped by user ID, IP address, or session cookie) can shed light on the flow of requests. A session with continuous hits every second to random URLs has a high probability of being a scraper. A session that spans a normal browsing period with certain predictable idle times suggests a real user.
Potential Follow-up Questions
Could We Rely Solely on IP-Based Analysis?
It is tempting to block IP addresses with unusual traffic, yet this can be too coarse. In corporate environments, many employees may share an IP address through a proxy, and legitimate CDNs may produce high traffic spikes. Over-reliance on IP-based blocking risks false positives. A more refined approach integrates IP signals with the patterns of requests, user-agent details, and session-based analytics. Over time, you might see repeated suspicious patterns from the same IP range, prompting deeper investigation. However, scrapers can also use rotating IP services, making IP alone insufficient.
How Would You Handle Very Advanced Scrapers That Randomize Patterns?
Extremely sophisticated scrapers attempt to replicate typical human behavior. They can add random delays, simulate mouse movements, or cycle through user agents. Handling such cases involves introducing more advanced fingerprinting. This could include gathering front-end metrics like rendering times, DOM event patterns, or subtle JavaScript detection that scrapers might not accurately replicate. You can also deploy ongoing anomaly detection in high-dimensional feature space, looking for slight inconsistencies in how these sophisticated bots behave. Real human traffic tends to have natural unpredictabilities at scale, whereas synthetic patterns, even if randomized, often exhibit small quirks in distribution.
How Would Unsupervised Anomaly Detection Work Here?
Unsupervised anomaly detection relies on the assumption that most traffic is legitimate, and outliers in the feature space might be scrapers. You could cluster sessions based on behavioral metrics. If certain clusters exhibit extremely high frequency of hits with minimal variation, or have unusual user-agent distributions, you can label them as potential bot clusters. This does not require labeled data but might lead to false positives if a legitimate traffic spike appears unusual. Combining unsupervised methods with partial labeling, if available, can refine this process.
What About Real-Time Implementation Challenges?
Real-time detection involves evaluating requests or sessions as they come in. A naive approach might slow down legitimate traffic. Efficient feature computation is crucial. You might maintain rolling windows of request data per session or per IP, quickly extract time-based patterns (like average requests per second), and compare them to a model threshold. Caching derived features can reduce overhead. In highly dynamic environments, a well-optimized streaming pipeline is needed, often implemented with message queues and a real-time feature store to keep classification latency low.
Could We Use Deep Learning Models for Bot Detection?
Deep learning can capture nonlinear relationships in user behavior data. Recurrent neural networks or Transformers could model user sessions as a sequence of page requests with time intervals, extracting subtle patterns. However, collecting and labeling large datasets is necessary to train deep models effectively. Interpretability can be a challenge, though techniques like feature attribution or attention visualization can give partial insights. In practice, teams often start with simpler models (like logistic regression or gradient-boosted trees) for interpretability and then consider deep learning if the scale and complexity of data are suitable.
How Would You Validate The Model’s Performance?
A balanced dataset that includes both bot and human sessions is important. Typical performance metrics include precision, recall, and F1-score. A high recall model flags more bots but may also increase the false positive rate, annoying real users. It helps to define an acceptable trade-off between blocking malicious traffic and impacting real users. You might run an A/B test where a portion of traffic is classified in real-time and blocked or challenged (via CAPTCHA), monitoring user feedback and acceptance rates to refine thresholds.
Could CAPTCHA or Other Challenges Help?
A secondary challenge mechanism like CAPTCHA can catch suspicious traffic that has borderline scores in your classification system. Legitimate users can generally pass a CAPTCHA, while many scrapers fail. However, advanced scraping frameworks can sometimes solve CAPTCHAs using external machine learning APIs. Additionally, frequent CAPTCHAs degrade user experience. Strategies often combine a behind-the-scenes risk engine with occasional challenges only for high-risk sessions.
How Do We Adapt Over Time?
Scrapers evolve tactics, so you must regularly retrain or fine-tune models. Monitoring drift in features is crucial. For instance, if scrapers start randomizing intervals, the distribution of request times might shift. Anomalies become more subtle. Continuous monitoring of prediction confidence and false positives helps in adjusting the detection pipeline and in building improved training sets with newly discovered bot behaviors.
Example Code to Illustrate a Simple ML Pipeline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Suppose 'data' is a pandas DataFrame with columns:
# 'avg_time_between_requests', 'num_pages_accessed', 'unique_user_agent_score', 'label'
# 'label' = 1 for human, 0 for bot/scraper
X = data[['avg_time_between_requests', 'num_pages_accessed', 'unique_user_agent_score']]
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
This is a simple illustration. Real-world data pipelines incorporate sessionization, time-window feature engineering, IP-based or device fingerprint features, and more advanced validation strategies. The final classification step might also rely on ensemble methods rather than a single model to achieve higher robustness.
Below are additional follow-up questions
What if a portion of the traffic comes from headless browser frameworks such as Puppeteer or Selenium?
One subtle challenge arises when advanced scrapers are developed using headless browser frameworks like Puppeteer or Selenium, which can execute JavaScript and replicate some human-like behaviors such as loading images, waiting on the page, or even randomizing mouse movements. This complicates detection because traditional indicators (like the absence of JavaScript execution) become less reliable.
A potential approach is to collect more granular front-end metrics that even sophisticated headless browsers might not fully replicate or might replicate at unrealistic scales. Examples include continuous tracking of dynamic user interactions or more nuanced timing signals such as the variance in how DOM elements are accessed over time. While scrapers can artificially create delays, they might miss the complex distributions of real user behavior or certain events unique to specific browsers or devices.
One risk is overfitting. If the detection logic relies too heavily on specialized signals from certain browsers (e.g., missing fonts or unusual screen resolutions), a sophisticated bot developer may adapt and circumvent these checks. Hence, an effective strategy is multifaceted—combining behavior-based signals, anomaly detection on request patterns, and regular revalidation to adapt to new scraper techniques.
How can we differentiate or accommodate legitimate bots, such as search engine crawlers?
Legitimate bots (e.g., from Google, Bing, or approved aggregators) serve important functions, like indexing content. Blocking them may hurt site discoverability or disrupt partnerships. Typically, these bots identify themselves in their user-agent strings. However, an advanced malicious actor can spoof such user agents. A common tactic is to use reverse DNS lookups or official IP whitelists provided by search engines to verify if traffic is truly coming from recognized domains.
A dedicated rules-based system can allow known crawlers while funneling uncertain traffic into a more rigorous classification pipeline. For instance, if a user-agent claims to be “Googlebot” but the IP does not match any Google data center range, it is likely an imposter. This step is crucial for preserving the site’s SEO while still blocking bad actors who might copy and paste legitimate user-agent strings.
If traffic from some partners is non-human, do we still classify them as bots?
In many scenarios, business partners (for instance, data aggregators) may send substantial automated requests for valid reasons (e.g., price comparisons or inventory synchronization). While this traffic is “bot-like,” it might be allowed or even encouraged. The question becomes whether to label these requests as bots and block them or treat them as “trusted partners.”
One solution is to implement a “managed allowlist” for known partners. The system can still label them internally as bots for statistical purposes but not block them. This approach helps maintain accurate analytics for overall traffic classification while avoiding disruption of critical integrations. Monitoring these partners’ traffic patterns is still important, as an unexpected surge or suspicious data usage might indicate credentials have been misused by a malicious third party.
What data privacy or compliance concerns must be considered when collecting user information for detection?
Storing detailed logs for detection can include sensitive information like IP addresses, browser fingerprints, or time-based usage patterns. Depending on the region, regulations such as GDPR or CCPA might impose restrictions. You need to consider minimizing the retention period for raw logs or anonymizing data that is not strictly required.
One approach is to hash personally identifiable information (PII) or tokenize IP addresses at ingestion, then compute detection features on the hashed data. Feature derivation typically focuses on aggregated statistics—like average number of requests per minute—rather than storing complete browsing histories. Ensuring compliance with data protection laws also means properly disclosing data collection practices in the site’s privacy policy and letting users opt out of certain tracking measures where legally required.
How do we manage false positives to avoid damaging the user experience?
A high-precision detection system might err on the side of caution, classifying borderline cases as bots. For legitimate users, sudden blocks or friction (e.g., forced CAPTCHAs) can be frustrating and reduce site engagement. One strategy is to use confidence thresholds and gentle second-layer challenges. For instance, if the model is only moderately confident about a user being a bot, you could present a less disruptive test or short challenge. If the user passes, the session is whitelisted for some period.
Additionally, continuous monitoring of user complaints or support tickets is essential. If many legitimate users are flagged, the detection parameters may need adjustment. A gradual rollout of new detection rules with A/B testing can also help measure the impact on user experience before fully deploying changes.
What is the trade-off between building a complex detection system and a simpler, lower overhead approach?
A more complex pipeline—potentially involving advanced ML models, real-time data streaming, and sophisticated features—can detect scrapers with greater accuracy. However, the costs and complexity of development, maintenance, and infrastructure also increase.
In smaller sites or low-traffic situations, a simpler, rules-based approach may be sufficient: for example, just limiting requests per IP over short time intervals or blocking known bad user-agent strings. While less robust and easier to circumvent, it might offer a better cost-to-benefit ratio in these environments. Companies with high-value data or large-scale traffic often justify the investment in advanced ML-based systems because the risk of data theft or site performance issues is more significant.
Could external threat intelligence or specialized hardware be integrated?
Some organizations partner with security companies that maintain lists of malicious IPs or known bot networks. Integrating this threat intelligence into your detection system can offer a proactive layer, blocking or challenging requests from suspicious ranges. Additionally, specialized hardware or cloud-based solutions (like web application firewalls) can inspect network traffic in real time, offloading part of the detection burden.
A potential pitfall is that threat intelligence feeds are never perfect—aggressive blocking can cause false positives if legitimate users appear on lists due to shared hosting or proxy usage. Balancing external intelligence with your own behavioral modeling is important. You can treat these external signals as features or prior probabilities in the ML classification pipeline, weighting them according to your historical experience of accuracy.
What if there are large spikes in traffic that could be legitimate or malicious?
Websites occasionally experience traffic surges—for instance, during marketing campaigns, popular product launches, or viral content. Such spikes might appear suspicious to an anomaly-based system. A naive approach could mistakenly throttle or block legitimate visitors, causing missed sales or negative user experiences.
One safeguard is to integrate domain context or business knowledge. If a marketing push or external event is expected, a well-designed system can switch to a more lenient threshold for classification in that window. Another option is to detect “bursty but legitimate” traffic by examining correlated signals like consistent referrer pages (suggesting traffic is arriving from a known promotional source) or normal geographic distribution. If the spike is accompanied by suspicious indicators (high request rates from a narrow IP range, repetitive page access patterns, etc.), it’s more likely malicious.
How to maintain and scale the detection pipeline under growing data volume?
As traffic grows, the detection system must process more logs without sacrificing real-time performance. Key techniques include stream processing architectures (like Apache Kafka or AWS Kinesis) to handle large volumes in near real-time, followed by distributed frameworks (Spark or Flink) for feature computation and model application. Caching aggregate statistics (such as requests per minute per IP) in in-memory data stores (like Redis) helps reduce database bottlenecks.
Moreover, model retraining pipelines need to scale as well. Automated workflows that periodically sample new data, retrain, and validate the model should be established. Monitoring model drift—where the distribution of traffic changes or scrapers adapt—becomes essential. Infrastructure resiliency is also important, so the detection system does not become a single point of failure.
What if we want to attribute a scraper to a specific organization or competitor?
Sometimes business goals include identifying which competitors might be scraping data. This is trickier than just detecting bot vs. human. You might look at advanced signals such as the set of pages accessed, how often product detail pages are targeted, or geolocation clues from IP addresses. However, attributing traffic to a specific entity requires additional intelligence, such as known IP blocks owned by that competitor or repeated patterns linking multiple IPs back to a single entity.
Gathering enough evidence to be confident in attribution can be challenging and can risk false accusations if the same hosting provider serves multiple unrelated clients. Often, it becomes more of a legal or security-team investigation rather than purely a technical classification. Nevertheless, logging and storing relevant evidence could be important if you escalate it to your legal department or if you have contractual recourse against unauthorized scraping.