0:00
/
0:00
Transcript

"SteLLA: A Structured Grading System Using LLMs with RAG"

Generated below podcast on this paper with Google's Illuminate.

SteLLA uses a question-answering approach, enhanced by retrieval augmented generation, to grade short-answer student responses by checking how well a student’s response answers the evaluation questions.

https://arxiv.org/abs/2501.09092

Original Problem πŸ€”:

β†’ Manual grading of open-ended questions is time-consuming, especially for large classes or online courses, hindering their usage.

β†’ Existing automatic grading systems lack the ability to provide detailed feedback on specific knowledge points.

Solution in this Paper πŸ’‘:

β†’ SteLLA (Structured Grading System Using LLMs with RAG) uses reference answer and rubric based Retrieval Augmented Generation (R-RAG).

β†’ R-RAG extracts structured information from the reference answer and rubric by generating evaluation question-answer pairs.

β†’ An LLM grades student responses based on how well they answer these evaluation questions.

β†’ SteLLA provides both overall grades and breakdown grades with feedback.

Key Insights from this Paper 😲:

β†’ QA-based structured grading facilitates semantic understanding, going beyond text similarity comparison.

β†’ R-RAG leverages instructor-provided resources as a highly relevant knowledge base, simplifying retrieval.

β†’ GPT4 is proficient at capturing facts but can over-infer in grading tasks.

Results πŸ’―:

β†’ SteLLA achieves substantial agreement with human graders (Cohen's Kappa = 0.6720).

β†’ This is about 8% less raw agreement than human graders (0.8358)

β†’ In human evaluation, only 1 out of 676 GPT4 grading justifications was deemed irrelevant to the assigned grade.

Discussion about this video

User's avatar