SteLLA uses a question-answering approach, enhanced by retrieval augmented generation, to grade short-answer student responses by checking how well a student’s response answers the evaluation questions.
https://arxiv.org/abs/2501.09092
Original Problem 🤔:
→ Manual grading of open-ended questions is time-consuming, especially for large classes or online courses, hindering their usage.
→ Existing automatic grading systems lack the ability to provide detailed feedback on specific knowledge points.
Solution in this Paper 💡:
→ SteLLA (Structured Grading System Using LLMs with RAG) uses reference answer and rubric based Retrieval Augmented Generation (R-RAG).
→ R-RAG extracts structured information from the reference answer and rubric by generating evaluation question-answer pairs.
→ An LLM grades student responses based on how well they answer these evaluation questions.
→ SteLLA provides both overall grades and breakdown grades with feedback.
Key Insights from this Paper 😲:
→ QA-based structured grading facilitates semantic understanding, going beyond text similarity comparison.
→ R-RAG leverages instructor-provided resources as a highly relevant knowledge base, simplifying retrieval.
→ GPT4 is proficient at capturing facts but can over-infer in grading tasks.
Results 💯:
→ SteLLA achieves substantial agreement with human graders (Cohen's Kappa = 0.6720).
→ This is about 8% less raw agreement than human graders (0.8358)
→ In human evaluation, only 1 out of 676 GPT4 grading justifications was deemed irrelevant to the assigned grade.
Share this post