UBC Undergraduate Research

Evaluating LLM Performance in Essay Assessment : A Comparative Analysis of AI Grading and Feedback Systems for University English Courses Stasuik, Noah Carter

Abstract

With artificial intelligence rapidly transforming industries globally, its integration into higher education appears increasingly inevitable. This thesis explores the potential of using LLMs (Large Language Models) to grade students’ essays and provide feedback on their writing in 100-level university English courses. With grading consuming significant portions of professors’ and TAs’ time, there often remains insufficient opportunity to directly engage with students. In this study, various LLM models and assessment strategies were implemented to evaluate the quality of feedback and accuracy of grades delivered by AI systems in comparison to the original human graders. All participating students consented to be evaluated by both locally hosted AI models (Llama 3.1, 3.2) as well as OpenAI’s commercial offerings (GPT-4o-mini, o1, and o3-mini). The findings indicate that while AI currently lacks the consistency necessary to fully replace human assessment, newer and more powerful LLMs demonstrate progressively better performance in both grading accuracy and feedback quality. Furthermore, when these models are combined with specialized assessment methodologies, the results show even greater accuracy in both grading and feedback semantic similarity. Although the results confirm that AI cannot completely substitute human grading expertise, they strongly suggest that these technologies could serve as valuable assistive tools in the assessment process. The AI-generated feedback showed particular promise for helping students improve their work, with semantic similarity metrics achieving acceptable scores when compared to human-provided guidance.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International