ML Case-study Interview Question: LLM System for Generating and Evaluating Suspicious Financial Transaction Reports

Rohan Paul

Apr 20, 2025

Browse all the ML Case-Studies here.

Case-Study question

A large financial technology institution faces regulatory requirements to report suspicious transactions. Their compliance officers confirm suspicious behavior through existing machine learning models and manual checks. They then prepare a detailed free-text report (Reason for Suspicion) for submission to regulatory authorities. The company wants to speed up writing these narratives using an LLM-based text-generation system. They also want an effective, scalable way to evaluate narrative quality without relying entirely on manual review. How would you architect an end-to-end solution that:

Connect with me on X (Twitter)

Integrates human review with an LLM-based text generator
Uses an LLM-based evaluator to benchmark and score generated narratives
Mitigates false positives and potential biases in both generation and evaluation
Ensures consistent, compliant, and accurate suspicious transaction reports

Write a plan detailing your approach, including any specific techniques for text-generation, prompt engineering, model evaluation, and bias mitigation strategies.

Detailed solution

A robust setup includes data pipelines, model selection, a prompt strategy, and automated plus manual checks.

Building the narrative generator starts by gathering relevant case data from existing analytics systems. Compliance officers and investigators confirm suspicious activity, so their findings feed into the LLM prompt. The LLM uses structured details: merchant profile, transaction volumes, date and time patterns, and typical red-flag behaviors (excessive chargebacks, unusual hours, mismatch with expected business). The text generator must summarize these points in a coherent free-text narrative.

Context Injection Embed merchant info, suspicious transaction flags, and supporting evidence within the system prompt. Include instructions on how to structure the final text: introduction of the merchant, description of suspicious patterns, references to transaction attempts, and a concluding suspicion statement.

Human-in-the-loop Compliance officers must verify the AI-generated text before submitting. The system flags potential high-risk outputs for more thorough review. This step avoids fully automated suspicious activity reporting.

LLM-based evaluation Use a custom LLM-based evaluator that compares the AI-generated narrative to a reference or known-good text. The evaluator checks:

Coverage of main topics (fraud reports, transaction attempts, evidence)
Profile details (merchant name, business type)
Supporting facts (refunds, unusual times, amounts)
No invented data (no mention of unverified items)
Conclusion (closing statement that indicates a final decision)

Prompt the evaluator to produce a numeric score for each check and an overall score.

general_score = (sum_{i=1 to n}(check_i)) / n

Here, n is the number of checks (for instance, coverage, profile, facts, no fabrications, structure, conclusion). Each check_i is the evaluator’s score on a 0 to 5 scale. The sum of these partial scores, divided by n, defines the general_score.

Explain to the evaluator what a score of 0 or 5 means through few-shot examples of good and poor narratives. This calibration ensures consistency in scoring.

Combining manual and LLM-driven evaluation Humans still sample and review generated outputs. They compare their feedback with the LLM-driven evaluator’s results. If the AI's average score aligns with agent feedback, the pipeline is validated. If discrepancies appear, refine the prompt or retrain the generator. Evaluate the text's clarity and compliance with AML guidelines and ensure the suspicious activity is unambiguously described.

Addressing biases Position bias: Randomize which text is labeled “reference” or “candidate” in the evaluator’s prompt. Verbose bias: Explicitly instruct the evaluator to favor correctness over length. Self-enhancement bias: Mix in human-authored references for comparison so the evaluator does not always prefer AI-written samples.

Practical example LLM sees input describing a suspicious pattern for a hair salon merchant operating at 3 a.m. with high failure rates. The LLM generates a short, structured narrative. The evaluator then checks if the text includes the midnight transaction detail, the number of chargebacks, and the ultimate conclusion. A final numeric score plus textual feedback helps refine or approve the narrative.

Maintenance Regularly retrain the LLM on newly flagged suspicious cases to adapt to novel fraud typologies. Ensure regulatory updates get integrated into the prompting or an appended instruction so the final text meets new compliance rules.

What if the LLM fails to mention a critical detail in the final narrative?

It fails the topic coverage or supporting facts check. The evaluator’s numeric score for coverage/facts drops. A sub-threshold overall score triggers the compliance officer to re-check or re-generate. Adjust the prompt to emphasize crucial details. Possibly pre-fill the narrative structure with essential placeholders so the model cannot skip them.

How do you mitigate invented facts when the LLM occasionally hallucinates?

Present the LLM with strictly verified data. In the prompt, specify that any unsupported claim is disallowed. The evaluator checks a no invented facts criterion by comparing the generation against known data. If the LLM introduced unverified references (e.g., non-existent police report), the evaluator flags it and scores that section low. Retrain the model with consistent instructions, and incorporate chain-of-thought that references only known data.

How do you handle the numeric scoring for large-scale evaluation across multiple teams?

Define a consistent scoring range (0–5) for each check. Provide few-shot examples of good, average, and poor text. Store these examples in the prompt. For standardization across teams, unify the definitions for each numeric score. Use the formula for general_score as the final metric. Summaries of these scores get tracked in a shared dashboard. Position swapping in the evaluator’s prompt ensures fairness.

How do you confirm that the approach is trustworthy from a compliance standpoint?

Require final sign-off by a compliance officer. Provide a history of the LLM’s scores, manual agent notes, and any references used in the generation. Confirm that sensitive data is properly handled. Document each step of the pipeline to satisfy regulatory audits. Evaluate historical suspicious activity reports with the pipeline and compare them with actual regulatory outcomes. If consistent with historical best practices, the pipeline is considered reliable.

How would you ensure the system remains secure and protects sensitive financial data?

Use robust encryption for data at rest and in transit. Restrict LLM access to only de-identified or minimal PII. For reference-based evaluation, store original transcripts in a secure environment. Log only essential LLM interactions. Mask account identifiers. Ensure each step abides by internal data governance. Periodically run penetration tests on the pipeline endpoints and maintain secure token-based access for the LLM API.

How do you scale this solution to handle thousands of suspicious cases per day?

Automate everything except the final approval. Pipeline processes run on distributed systems. A message queue sends suspicious cases to the LLM generator. The evaluator runs concurrently, storing results in a scalable datastore. Compliance officers see only the flagged ones or random samples for QA. The system expands horizontally, adding more workers for LLM generation and evaluation. Monitoring tools track latency and throughput.

How do you handle incorrect evaluations caused by drift in the reference text or changes in AML rules?

Update the reference texts and the evaluator’s prompt with the latest compliance guidance. Fine-tune the LLM if the domain evolves significantly. Maintain a versioned library of reference samples. Periodically re-check old narratives with new rules to catch any drift. If changes are wide-ranging, incorporate new few-shot examples reflecting updated suspicious behavior.

How do you verify that the overall approach truly streamlines compliance officers’ workloads?

Benchmark the time a compliance officer spends drafting the Reason for Suspicion before and after adopting the LLM. Track how frequently officers override the LLM’s output or correct mistakes. Conduct satisfaction surveys with the compliance team. Analyze false positives or negatives in final suspicious reports. If manual tasks drop and officer satisfaction improves without an uptick in missed suspicious cases, the solution is meeting its goal.

Rohan's Bytes

Discussion about this post