SteLLA uses a question-answering approach, enhanced by retrieval augmented generation, to grade short-answer student responses by checking how well a studentβs response answers the evaluation questions.
https://arxiv.org/abs/2501.09092
Original Problem π€:
β Manual grading of open-ended questions is time-consuming, especially for large classes or online courses, hindering their usage.
β Existing automatic grading systems lack the ability to provide detailed feedback on specific knowledge points.
Solution in this Paper π‘:
β SteLLA (Structured Grading System Using LLMs with RAG) uses reference answer and rubric based Retrieval Augmented Generation (R-RAG).
β R-RAG extracts structured information from the reference answer and rubric by generating evaluation question-answer pairs.
β An LLM grades student responses based on how well they answer these evaluation questions.
β SteLLA provides both overall grades and breakdown grades with feedback.
Key Insights from this Paper π²:
β QA-based structured grading facilitates semantic understanding, going beyond text similarity comparison.
β R-RAG leverages instructor-provided resources as a highly relevant knowledge base, simplifying retrieval.
β GPT4 is proficient at capturing facts but can over-infer in grading tasks.
Results π―:
β SteLLA achieves substantial agreement with human graders (Cohen's Kappa = 0.6720).
β This is about 8% less raw agreement than human graders (0.8358)
β In human evaluation, only 1 out of 676 GPT4 grading justifications was deemed irrelevant to the assigned grade.