ML Interview Q Series: How would you detect and query users creating multiple accounts to upvote their own comments?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
To investigate whether users are inflating their own comment upvotes by creating multiple accounts, we often start by looking for suspicious patterns in the data, particularly in how comments are voted on. We need to connect three key data tables (users, comments, comment_votes) to detect signals of fraud. Below is a structured approach:
Identifying Suspicious User Behavior
One commonly used strategy is to look for unusual patterns in upvotes that might suggest users are coordinating or owning multiple accounts. Some potential warning signs:
Excessive Upvotes Concentrated on a Single User’s Comments If a comment consistently receives upvotes from a specific set of users who either share registration information (like email domain, referral link, IP address) or show suspicious activity patterns (like extremely short time intervals between account creation and votes), it may signal fraud.
Multiple Accounts Voting on the Same Comments A suspect user might create a group of accounts (sock puppets) that do nothing except upvote the main account's comments. Tracking user relationships (e.g., by shared IP or creation time) helps highlight clusters of accounts voting on the same target user.
Time-Based Anomalies If multiple upvotes occur within a short span shortly after the post is made, repeatedly, this can imply orchestrated, automated, or self-coordinated actions.
Potential Metrics to Flag Fraudulent Activity
Ratio of Self-Upvoting Users For each user, examine the proportion of upvotes they receive from accounts that share suspicious similarities (e.g., same IP address, same device fingerprint, or very close account creation dates).
Average Upvote Concentration A user might be flagged if 80%–90% of their total upvotes come from a small group of accounts that are also not active anywhere else in the forum.
Connection Graph Construct a graph where nodes represent user accounts, and edges represent "upvote" relationships. A subgraph containing multiple nodes with suspiciously tight interconnections can be flagged.
Statistical Deviation Look at the distribution of normal upvote patterns across users. If a user’s upvote pattern is several standard deviations away from typical behavior, they may be suspicious.
Formulating a Query to Calculate the Fraud Percentage
At its simplest, once you have a way to identify a suspicious user (for instance, a user who receives the majority of votes from only a few closely linked accounts), you can count how many users in total appear suspicious. The primary output is then the ratio (or percentage) of such flagged users over the total number of users.
Below is a conceptual representation of how you might accomplish this in SQL. Since the question specifically requests a query, we demonstrate it within Python code, but containing SQL:
query = """
WITH user_votes AS (
SELECT
c.user_id AS comment_owner,
cv.voter_id AS voting_user,
COUNT(*) AS total_votes_by_voter
FROM comments c
JOIN comment_votes cv ON c.comment_id = cv.comment_id
GROUP BY c.user_id, cv.voter_id
),
suspicious_users AS (
SELECT
comment_owner,
COUNT(*) AS suspicious_voter_count
FROM user_votes
WHERE voting_user IN (
-- Example condition:
-- Perhaps you define "suspicious voter" as someone who upvotes the same user's comments
-- and shares some common data with that user or has a suspicious pattern
SELECT uv.voting_user
FROM user_votes uv
JOIN users u1 ON uv.voting_user = u1.user_id
JOIN users u2 ON uv.comment_owner = u2.user_id
WHERE /* Insert your logic here, for instance:
IP address = same OR email domain = same
or any condition that tags the voter as suspicious
*/
u1.ip_address = u2.ip_address
)
GROUP BY comment_owner
)
SELECT
(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM users)) * 100 AS percentage_of_suspicious_users
FROM suspicious_users
"""
Here’s the logic in the above steps:
We create a
WITH
clause (user_votes
) that associates eachcomment_owner
with thevoting_user
who voted on their comment.We add a condition to identify “suspicious voters” (for example, where
u1.ip_address = u2.ip_address
might mean the same IP was used by both the voter and the comment owner, suggesting potentially the same person).In
suspicious_users
, we count how many suspicious voters each comment owner has.Finally, we compute how many unique users have at least one suspicious voter relative to the total user count.
Calculating the Percentage of Fraudulent Users
Once we identify which users are fraudulent, we can compute the fraction using:
Where:
count_of_suspected_users is the number of user accounts flagged as having fraudulent upvotes.
total_users is the total number of user accounts in the forum.
The fraction is then multiplied by 100 to convert it into a percentage. This final metric indicates the scale of the problem.
Common Follow-Up Questions
What if multiple users share a genuine IP address or network location (e.g., at a workplace)?
Sometimes legitimate users in the same physical location might end up having the same IP. To account for this scenario, you can refine your logic to incorporate additional signals. For instance, you might require multiple suspicious signals (like identical IP addresses, short creation intervals, very limited or singular voting patterns, minimal activity on other posts, or matching email domains) to classify a user as fraudulent. A single factor like an IP match alone can lead to false positives.
How do you deal with large-scale data and performance constraints?
When dealing with massive datasets:
Indexing is crucial for columns involved in JOIN conditions and WHERE clauses.
Data partitioning or sharding can be considered to handle large volumes of user or comment records.
Summaries or aggregated tables (materialized views) can reduce the overhead of constant recomputation of suspicious patterns.
Big data frameworks (e.g., Spark SQL) may be required if the data grows beyond the capacity of a single node.
What other factors might indicate user fraud beyond IP or creation time?
User-Agent fingerprint or device ID: If multiple accounts use the exact same device signature.
Behavior patterns: Repetitive login/logout patterns, unusual time-of-day access, or no real interaction aside from upvoting a single user.
Social graph analysis: If the same group of accounts repeatedly upvotes each other in a closed loop, forming a cluster.
How could we continuously monitor and flag suspicious users in real time?
Streaming data pipelines (e.g., Kafka + Spark Streaming or Flink) can track upvote events as they happen.
Real-time scoring: Maintain a suspiciousness score for each user that updates whenever a new vote is cast.
Alerts or thresholds: Trigger an alert if a user’s suspiciousness score exceeds a certain threshold, prompting further investigation.
How can we reduce the risk of labeling legitimate users as fraud?
Implement a gradual and multi-factor approach:
Use several different signals (IP, time intervals, number of unique voters, domain, etc.) so that an account must cross multiple suspicious thresholds before being flagged.
Regularly evaluate and update the model or rules based on real-world feedback or false positives.
Provide a feedback loop or appeals process for users to validate their authenticity.
By applying a multi-layered methodology—analyzing not only IP match but also user engagement patterns, comment text analysis (if permitted), clustering behaviors, and more—platforms can minimize false positives and effectively curb fraudulent self-upvoting.
Below are additional follow-up questions
Could advanced ML or anomaly detection techniques be used for more robust detection?
Modern detection strategies can go beyond rule-based checks (like IP matching or shared email domains). A sophisticated approach involves training anomaly detection models—such as isolation forests or autoencoders—to learn typical voting behaviors. These models can then flag unusual patterns that might indicate fraudulent collusion.
A crucial pitfall is ensuring you have enough representative data for normal vs. fraudulent voting. If the dataset is heavily imbalanced (very few fraudulent events), the model might learn patterns of “normal” behavior but remain weak in identifying fraud. Careful handling of class imbalance (through techniques like oversampling suspicious cases, using anomaly detection, or applying cost-sensitive methods) can help. In real-world scenarios, updating and retraining these models regularly is vital because fraudsters adapt to detection strategies over time.
Could a user rotate IP addresses or proxy services to evade detection?
Many sophisticated attackers cycle through proxies, VPNs, or even residential IP services to conceal their true origin. When users do this, a simple IP-based rule can easily fail because the system sees seemingly unrelated IPs for the same user. Attackers might also alter their device fingerprints or user-agent strings to make their accounts look more distinct.
A potential edge case is a user on a dynamic IP address or mobile network that changes frequently, which can generate false positives. Combining multiple signals—time of voting, frequency of account creation, browsing behavior, correlation among comments, device fingerprinting, etc.—helps reduce reliance on IP alone. There is no perfect, foolproof method, but layering these signals makes evasion harder and more resource-intensive for the fraudster.
How do we handle the trade-off between user privacy and rigorous detection?
Stricter measures, such as collecting detailed device fingerprints or always logging precise user metadata (e.g., hardware info, precise location), may increase detection accuracy. However, these strategies raise privacy concerns and can erode user trust if implemented without transparent policies.
A real-world edge case is when legitimate privacy-conscious users (e.g., people who use VPNs for personal reasons or those in restrictive countries) appear similar to potential fraudsters. To handle this scenario:
Adopt a privacy-by-design philosophy: store minimal necessary data for detection and discard sensitive details as soon as possible.
Provide transparent user agreements explaining the scope of data collection.
Offer a path for legitimate users to verify themselves without excessive intrusion (e.g., trusted device verification or alternative forms of proof).
Could malicious users artificially inflate engagement metrics to mislead detection systems?
Yes, attackers might attempt to blend in by making accounts post “normal” content, comment on multiple threads, or engage in varied votes, rather than just upvoting their main account. This approach can trick simplistic, rule-based detectors that look for “low activity” sock-puppet accounts.
An edge case is a highly active group of fraudulent accounts that appear even more engaged than genuine users. They might form a network of cross-votes among themselves to build “legitimacy” while also boosting the main target account. Countering this requires analyzing entire vote networks and looking at deeper metrics like shared post interactions, suspicious velocity (e.g., too many actions in a short period), or unrealistically uniform patterns in voting behavior.
How do you effectively measure false positives vs. false negatives in an automated detection pipeline?
Balancing false positives (legitimate users mistakenly flagged) and false negatives (fraudsters who slip through) is critical. In practice, one might deploy a scoring system that assigns each user a “suspiciousness” score based on multiple signals. Raising the threshold reduces false positives but may increase false negatives, and vice versa.
A key pitfall is ignoring the context of the community. For some forums, a stricter approach is necessary because fraudulent upvotes can severely damage trust. For others, such as smaller hobbyist communities, an overly aggressive detection mechanism might alienate legitimate users. It’s also important to track metrics like:
Precision: Of the accounts flagged as suspicious, how many truly are?
Recall: Of all actual fraudsters, how many did we catch? Regularly auditing flagged accounts, either manually or through a human review system, can help refine and calibrate detection thresholds over time.
Are there any legal or community-driven considerations when banning suspicious accounts?
Outside of purely technical measures, suspending or banning accounts suspected of fraud can have legal and reputational implications. In some regions, overly aggressive or inaccurate enforcement might lead to user backlash or even lawsuits if users feel unjustly targeted.
A subtle real-world scenario is when influencers or marketers are banned, claiming they were wrongfully flagged by an algorithm. This can escalate into public disputes or media coverage. To address these challenges, ensure:
The platform’s Terms of Service explicitly prohibit sock puppeting or fraudulent upvotes.
Users have a clear appeals process if they believe they were incorrectly flagged.
There’s a transparent system of warnings or staged enforcement (e.g., temporary suspensions) before imposing permanent bans, to allow genuine users an opportunity to correct issues or explain misunderstandings.