0:00
/
0:00
Transcript

"SteLLA: A Structured Grading System Using LLMs with RAG"

Generated below podcast on this paper with Google's Illuminate.

SteLLA uses a question-answering approach, enhanced by retrieval augmented generation, to grade short-answer student responses by checking how well a student’s response answers the evaluation questions.

https://arxiv.org/abs/2501.09092

Original Problem 🤔:

→ Manual grading of open-ended questions is time-consuming, especially for large classes or online courses, hindering their usage.

→ Existing automatic grading systems lack the ability to provide detailed feedback on specific knowledge points.

Solution in this Paper 💡:

→ SteLLA (Structured Grading System Using LLMs with RAG) uses reference answer and rubric based Retrieval Augmented Generation (R-RAG).

→ R-RAG extracts structured information from the reference answer and rubric by generating evaluation question-answer pairs.

→ An LLM grades student responses based on how well they answer these evaluation questions.

→ SteLLA provides both overall grades and breakdown grades with feedback.

Key Insights from this Paper 😲:

→ QA-based structured grading facilitates semantic understanding, going beyond text similarity comparison.

→ R-RAG leverages instructor-provided resources as a highly relevant knowledge base, simplifying retrieval.

→ GPT4 is proficient at capturing facts but can over-infer in grading tasks.

Results 💯:

→ SteLLA achieves substantial agreement with human graders (Cohen's Kappa = 0.6720).

→ This is about 8% less raw agreement than human graders (0.8358)

→ In human evaluation, only 1 out of 676 GPT4 grading justifications was deemed irrelevant to the assigned grade.

Discussion about this video