ML Case-study Interview Question: LLMs and Personalization Models for Generating Adaptive Language Lessons

Rohan Paul

Apr 15, 2025

Browse all the ML Case-Studies here.

Case-Study question

A global education platform wants to speed up the creation of its language lessons using Large Language Models (LLMs). The teaching team wants to plan lessons on specific grammar or vocabulary, then automatically generate exercises in multiple languages. They also want to personalize lesson difficulty for each user. Propose a system design and strategy for implementing LLM-based lesson generation, including how to integrate a proprietary recommendation model that adjusts exercises based on user strengths and weaknesses. Explain your approach in detail, focusing on prompt-engineering, model selection, quality assurance, and practical deployment considerations.

Connect with me on X (Twitter)

Detailed solution

Overview

The aim is to build an automated pipeline for generating language exercises. The system combines Large Language Models for text generation with a personalization model (similar to Birdbrain mentioned in the background) that adjusts difficulty. The pipeline must preserve educational quality, reduce creation time, and handle multiple courses.

System Flow

The teaching experts plan lesson objectives (themes, grammar focus, vocabulary level). They feed these objectives into the LLM through carefully crafted prompts. The LLM outputs a set of draft exercises. The teaching experts edit or refine these drafts, then store them in a content database. The personalization model dynamically selects which exercise to serve to each user, based on that user’s performance data.

Prompt Engineering

The team prepares a prompt template with fixed rules (exercise length, format, level, grammar focus) plus variable inputs. The prompt enforces length limits, part-of-speech requirements, and target language. The LLM uses these instructions to produce multiple candidate exercises.

Core LLM Principle

Below is a common formula for the next-word probability inside an LLM, where s(...) is a scoring function:

Here, w_{t} is the next token, w_{1} through w_{t-1} are previous tokens, and s(...) measures compatibility between tokens. The model computes exponential scores for each possible next token and normalizes them with a softmax.

Human-in-the-Loop Editing

LLMs sometimes produce grammatically correct but unnatural sentences. The teaching experts apply final editorial control. They tweak phrases to fit intended difficulty and style. They ensure the final content matches the educational guidelines.

Personalization Model

A system similar to Birdbrain uses learner performance data to choose the next exercise. It tracks past successes and errors. It picks items that match each user’s zone of proximal development (not too easy, not too hard).

Example Prompt and Code

Here is a simple Python snippet showing how you could structure a prompt generation call. The explanation follows the code:

import openai

def generate_exercises(theme, language, level, grammar_focus):
    base_prompt = f"""
    You are a language-teaching assistant.
    Write 5 short exercises using the theme '{theme}' in {language}.
    Each exercise must follow these rules:
    1) Contain exactly 2 answer options.
    2) Stay under 75 characters total.
    3) Use grammar: {grammar_focus}.
    4) Reflect {level} difficulty (Common European Framework of Reference for Languages).
    Output only the exercises.
    """
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=base_prompt,
        max_tokens=200,
        temperature=0.5
    )
    return response.choices[0].text.strip()

This code forms a prompt containing the specified rules. The temperature parameter controls randomness. The team inspects and refines outputs. If the model occasionally breaks rules, the prompt is adjusted or post-processing filters are applied.

Quality Checks

Generated exercises are tested for correctness. The editorial team uses internal checks for grammar accuracy, complexity alignment with Common European Framework of Reference for Languages (CEFR) levels, and theming. Any disallowed content is screened out.

Deployment

Once exercises pass review, they are stored in a structured database. The personalization system serves them to users. The platform tracks performance and user feedback in real time, and it can fine-tune prompts or filters for continuous improvement.

Follow-up question 1

How would you handle situations where the LLM produces grammatically correct text that still feels unnatural?

Answer: Teach the LLM through more precise prompts and extra examples representing natural style. Include examples of desired output and undesirable output in the prompt. Apply a post-processing stage that allows teaching experts to rewrite awkward phrases. Use a language style guide with typical phrases or expressions. Fine-tune the model on in-domain examples if allowed by the LLM infrastructure. The final safety net is a human editorial pass that corrects unnatural expressions before publication.

Follow-up question 2

What steps ensure the model stays within educational guidelines for different proficiency levels?

Answer: Embed explicit CEFR references in the prompt. Specify approximate sentence structures and vocabulary suitable for each level (A1, A2, B1, etc.). Show example sentences with suitable difficulty. Use a curated vocabulary list that matches the target level. Validate outputs through automated lexical checks, ensuring the words belong to the correct difficulty tier. If the platform collects user data on performance, it can flag exercises that appear too challenging or too easy. This feedback loop refines future prompts and generation logic.

Follow-up question 3

How can you prevent the LLM from generating offensive or irrelevant content?

Answer: Add guidelines in the prompt that forbid certain topics and enforce a safe style. Use a content filtering mechanism before finalizing exercises. Block or remove any response containing disallowed keywords or phrasing. Implement an intermediate approval step where the teaching experts check outputs for alignment with brand voice and educational values. If offensive content persists, refine the prompt or use an LLM that includes built-in moderation.

Follow-up question 4

How do you measure success and evaluate model performance in this lesson-generation pipeline?

Answer: Track user engagement metrics like completion rates and time spent on exercises. Study learning outcomes through quiz results and repeated measures (pre-test vs. post-test). Collect feedback on exercise clarity, difficulty, and enjoyment. Log how often human editors rewrite generated content to gauge LLM accuracy. Compare learning outcomes from these LLM-generated lessons with a control group that uses only manually written content. The difference in user learning progress signals whether LLM assistance leads to better or faster outcomes.

Rohan's Bytes

Discussion about this post