ML Case-study Interview Question: Real-Time Fraud Detection at Scale Using Streaming Pipelines and ClickHouse Analytics

Apr 17, 2025

Browse all the ML Case-Studies here.

Case-Study question

A large online platform receives millions of transactions daily and needs to identify and stop suspicious orders in near real-time. They have built a custom fraud-detection engine to generate and evaluate rules that reference user features across multiple data streams. They use Kafka to capture both relational database updates and user events, then process these streams with a config-based data pipeline to store them in a real-time analytics database. The rules engine queries this database on incoming requests, retrieves and aggregates fresh features, and decides whether to flag the transaction or suspend the account.

Assume you are tasked with designing and optimizing this system for massive scale, ensuring low latency, data freshness, and maintainability. Propose your approach to:

Ingest data continuously from a relational store and event streams into a real-time analytics database.
Provide a self-serve feature-engineering layer for fraud analysts.
Execute rule-based evaluations in a fraction of a second.
Maintain system performance, data integrity, and tight access controls.

Explain your design in detail, including architecture, key technology choices, query optimizations, and steps to handle feature definition changes. Include the tradeoffs of batch vs. real-time ingestion and how you would protect sensitive data. Outline how you would generalize the platform for other similar use cases beyond fraud. After your explanation, answer follow-up questions about each critical component.

Proposed Solution

A high-level architecture uses streaming pipelines, a flexible rule engine, and a columnar database. Each component is detailed below.

Data Ingestion and Processing

Use a relational database (e.g. PostgreSQL) as the primary system of record, capturing critical transactions such as orders and account details. Set up Change Data Capture to stream new or updated rows to Kafka, where each table has a dedicated topic. For event data from mobile or web clients, publish them as JSON to separate Kafka topics.

A stream processing framework (e.g. Flink) subscribes to those topics. It applies transformations, mappings, and format validation. It then batches inserts into a columnar analytics database that supports optimized queries for fraud detection. Flink jobs rely on YAML configurations to define field mappings, cluster endpoints, intervals for flushing data, and other ingestion parameters.

Real-Time Analytics Database

Choose a database like ClickHouse for its columnar design and merge-tree architecture, which handles large ingestion workloads and performs aggregations quickly. Set a sorting key that includes an entity identifier and timestamp. This reduces scanned data for time-window queries in fraud checks. Data is appended in short intervals for near real-time freshness.

Keep a watch on data-lag tradeoffs by balancing ingestion batch sizes against query performance. For consistent throughput, insert data in batches rather than single rows. Configure table TTLs (time-to-live) if old records are not needed for short-window fraud detection.

Self-Serve Feature Engineering

Enable fraud analysts to define custom feature queries as YAML definitions. Each definition specifies:

The analytics database table(s) to query
A short time window (e.g. last 24 hours)
The entity identifier (shopper_id, order_id, etc.)
Simple transformations or aggregations (e.g. summations or counting suspicious events)
Output fields that will be used in rules

Store these definitions in a centralized registry. Analysts can modify or add definitions without requiring code changes, as the analytics engine dynamically compiles and runs these SQL queries to retrieve features.

Rule Evaluation and Decisioning

When a new transaction or event arrives for evaluation, the rule engine:

Identifies which feature definitions to invoke
Sends the relevant parameters (e.g. order_id, shopper_id) to the analytics database
Retrieves the aggregated features
Compares each feature to thresholds or conditions in the rule logic
Triggers the appropriate action (e.g. flag, suspend, manual check)

Keep these rules in a UI that analysts manage. This removes engineering bottlenecks for creating or updating heuristics. For advanced patterns, analysts may combine multiple feature definitions in the rule logic.

Access and Security

Enforce two-step access. First, control which clients can reach the analytics database endpoint. Then require explicit credentials for each authorized user group. Mask or encrypt sensitive columns if needed. Create dedicated database accounts for read-only queries, with strict usage auditing.

Extending Beyond Fraud

The same platform can serve other real-time analytics use cases. For example, marketing could define user-segmentation features, or operations could track performance metrics. The fundamental pattern—stream ingestion to a columnar database, self-serve feature queries, and rule-based actions—can support multiple domains.

What are your plans for ingesting large volumes of event data with minimal lag?

When handling high event volumes, set up Kafka partitions to parallelize reads. Use Flink’s scaling to distribute load. Insert data into the analytics database in regular, small batches (for example, every few seconds) to keep ingestion latencies low. Balance the frequency of these writes and flush sizes, because too-frequent writes can degrade performance. Leverage the database’s distributed table feature to process inserts concurrently across multiple nodes.

How do you ensure query efficiency as data grows?

A carefully chosen sorting key is crucial. Sort rows by a primary entity (customer_id or shopper_id) and a time column. This restricts queries to relevant segments when retrieving recent data. Partitioning can be used if data volumes are very large. Regularly optimize or merge partitions in the database to reduce fragmentation. Monitor query profiles for slow responses, and tune materialized views if certain aggregations are repeated.

Can you explain how analysts create and deploy a new real-time feature?

They open a YAML config file, specifying:

The table and columns to read
The entity or join keys
Aggregations or filters for data within a time window
Output columns representing the final feature values

They commit the file to a version-controlled repo. The ingestion pipeline and the real-time rules engine read the updated config. On the next deployment cycle, the engine starts querying the new feature. This self-serve flow removes the need for manual developer changes. The rules can reference that feature through an identifier in the rules UI.

How do you maintain data security with personally identifiable information?

The database is only reachable from specific internal networks. For an external user to query the database, they need explicit credentials. Maintain row-level or column-level filtering if required, especially for sensitive personal data. Hash or tokenize personal data before it enters the analytics pipeline. Keep strict auditing logs of queries.

If a data pipeline lags, how do you mitigate its effects?

Configure rules to consider how recent the data might be. Decide if the system should block or allow transactions when data is stale. If ingestion is consistently delayed beyond a threshold, set alerts to investigate system health. The pipeline can also be designed to handle replays for partial data ingestion failures. If the system can accept slight delays, batch intervals can be increased to ease overhead.

Why would you choose this design over a purely batch-based approach?

Real-time decisions block or allow transactions immediately. A purely batch-based design processes events in large time windows, potentially letting more fraudulent activities pass for longer. The suggested architecture processes data continuously, ensuring the platform can suspend or flag an account as soon as suspicious behavior is detected.

Are there additional optimizations or future improvements?

Extend the feature platform with a user-friendly UI for complex SQL queries and schema definitions. Automate pipeline scaling using container orchestration. Add caching layers for the most frequent queries to reduce load. Refine TTL settings based on updated business needs. Explore materialized views that pre-compute common aggregations.

How would you make sure new features do not break existing rules or performance?

Test new YAML definitions in a staging environment. Confirm queries can return correct data within acceptable latencies. Validate syntax, run integration tests, and measure resource usage. Use canary deployments for new features, letting them run on small subsets of data before full rollout. Monitor system metrics in real time. If any issues are found, roll back to the previous stable version.

Rohan's Bytes

Discussion about this post