ML Case-study Interview Question: Using LLMs for Resilient Automated Mobile Testing Across Evolving UIs and Languages.

Apr 12, 2025

Browse all the ML Case-Studies here.

Case-Study question

You are a Senior Data Scientist at a global technology firm overseeing large-scale mobile applications. The firm operates in many regions and languages, and you must ensure stable end-to-end testing for the mobile apps. Existing manual testing is time-consuming, error-prone, and expensive. Traditional script-based automated testing is frequently disrupted by minor UI changes, new pop-ups, and different languages. The leadership wants a more robust, zero-maintenance approach that uses generative AI to handle these complexities. You are asked to design a system that accomplishes automated testing with minimal manual maintenance and high global scalability.

Describe how you would design this system using large language models to parse screen text and decide what actions to take. Mention how you would gather data, choose your model architecture, handle model outputs, and integrate with your CI systems. Show how you would address unexpected UI changes, geographical variations, and languages. Explain any weaknesses of large language models (such as hallucinations) and how you might apply guardrails. Propose how you would keep these automated tests stable despite app fluctuations, multiple OS versions, and device differences.

In-Depth Solution Approach

Motivating Idea

Large language models handle sequence prediction. A sequence of UI actions can be treated as a text-generation problem. Each new screen becomes context that the model uses to generate the next action. This approach eliminates the rigidity of traditional test scripts. The model adapts to UI changes in real time and handles multiple languages without extra overhead.

Model Training and Evaluation

A medium-sized transformer-based model is often enough for real-time usage. A popular choice is a model with around 100M parameters. A text embedding layer converts screen text into high-dimensional vectors representing semantic meaning. A retrieval-based framework steers the next action based on the closest matching scenario. Precision metrics evaluate retrieval quality.

N is the number of ranked actions you consider from the model. Higher precision@N means the model reliably retrieves the best actions.

Handling Embeddings

Embeddings capture screen text meaning in multi-dimensional space. A similarity measure is computed to find the most relevant action given the screen text.

u and v are embedding vectors of the screen text and possible actions.

Addressing Adversarial and Unexpected Cases

Minor UI changes sometimes produce inconsistent responses. Training with adversarial screens helps the model learn robustness. If the model chooses a wrong path, a fallback mechanism rechecks the emulator’s ground truth to correct or reject invalid actions.

Guardrails Against Hallucinations

A smaller or medium-sized language model tends to be less erratic. When the model outputs an invalid action, the CI pipeline flags it and prompts the model to adjust. If it fails again, the system retries with a different candidate action or restarts the test path entirely.

RAG-Style Expansion

A retrieval augmented generation (RAG) approach can store known goal templates. The model retrieves relevant templates when encountering new flows. This shortens the domain adaptation process to new features. The approach: store goals (like “complete a ride”), retrieve them at runtime, then generate the next step.

Example Code Snippet

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("pretrained-model")
model = AutoModel.from_pretrained("pretrained-model")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    hidden_states = outputs.last_hidden_state
    # Basic pooling strategy
    embedding = torch.mean(hidden_states, dim=1).squeeze()
    return embedding

This snippet converts screen text into an embedding. A separate function compares the embedding to candidate action embeddings and chooses the best match.

CI Integration

The system must run with each code change. When triggered, the model receives the latest UI state, picks an action, and repeats until the final objective is met or the sequence fails. A logging module collects details to help debug real failures versus transient or language-based issues.

Maintenance and Results

No test script rewrites are needed for minor UI differences. Time-consuming tasks like rewriting button-locators are eliminated. A single test plan covers multiple languages, cities, and OS versions. If a feature or pop-up changes, the model still completes its objective, much like a human tester.

Unexpected GPS and Network Issues

For ride or delivery flows, real backend matching can fail or cause delays. Automatic re-tries and location checks let the model keep pushing the correct button or even restart the app if it hits a blocking state. The system remains stable, and transient bugs do not escalate into false alerts.

Why This Approach Works

A language-based action generator, paired with robust fallback logic, emulates human intuition at scale. It drastically reduces maintenance and extends test coverage across many locales. Traditional tests break when minor UI changes occur or new prompts appear. A goal-driven AI system interprets these new prompts and chooses the next best move.

Potential Follow-Up Questions

How would you ensure your test architecture handles real-time pop-ups and partial failures?

Implement an event loop that continuously feeds the latest screenshot text to the model. Parse the screen elements, compare them with prior states, and if a new pop-up is detected, retrieve or generate an action to resolve or dismiss it. Keep a short memory buffer so the model knows what was tried before. If the partial failure is repeated, trigger a fallback path that closes and reopens the app or logs the error.

Why not just rely on a massive model for better accuracy?

A massive model often has greater latency and resource costs. A smaller or medium-sized model runs faster in CI. The difference in accuracy can be negligible if carefully tuned with your domain data. Maintenance of a huge model can be more expensive. A well-tailored medium model meets real-time CI demands.

How do you handle multi-language flows that include complex local strings or scripts?

Pre-train the model on text from multiple locales. Leverage subword tokenization so characters in languages like Chinese, Arabic, or languages with diacritics are handled consistently. Data augmentation adds localized strings for greater coverage. The model’s embedding space remains consistent across languages, so standard retrieval and generation logic applies.

How would you deal with infinite loops or repeated actions in your pipeline?

Track recent actions with timestamps. If the same action triggers repeatedly with no progress, the system backtracks. Keep a history of recently seen screens in memory, and if a screen reappears too many times, skip that path. This prevents indefinite cycling on one flow.

How do you measure success beyond precision@N?

Monitor success rate of each test goal. Track how many times the model times out or loops. Observe code coverage metrics. Evaluate the number of real defects caught before release. Measure engineering time saved. These indicators show if the system truly reduces overhead and catches real bugs.

How do you handle model retraining when the UI changes drastically over time?

Maintain a pipeline that logs new UI screens encountered, plus the actions chosen. Periodically fine-tune the model with these new patterns. Add validation checks to confirm the model’s actions align with developer intent. This keeps the model up-to-date with evolving UI designs.

Rohan's Bytes

Discussion about this post