Evaluating LLM Performance in Essay Assessment : A Comparative Analysis of AI Grading and Feedback Systems for University English Courses

UBC Undergraduate Research

Evaluating LLM Performance in Essay Assessment : A Comparative Analysis of AI Grading and Feedback Systems for University English Courses Stasuik, Noah Carter

Abstract

With artificial intelligence rapidly transforming industries globally, its integration into higher education appears increasingly inevitable. This thesis explores the potential of using LLMs (Large Language Models) to grade students’ essays and provide feedback on their writing in 100-level university English courses. With grading consuming significant portions of professors’ and TAs’ time, there often remains insufficient opportunity to directly engage with students. In this study, various LLM models and assessment strategies were implemented to evaluate the quality of feedback and accuracy of grades delivered by AI systems in comparison to the original human graders. All participating students consented to be evaluated by both locally hosted AI models (Llama 3.1, 3.2) as well as OpenAI’s commercial offerings (GPT-4o-mini, o1, and o3-mini). The findings indicate that while AI currently lacks the consistency necessary to fully replace human assessment, newer and more powerful LLMs demonstrate progressively better performance in both grading accuracy and feedback quality. Furthermore, when these models are combined with specialized assessment methodologies, the results show even greater accuracy in both grading and feedback semantic similarity. Although the results confirm that AI cannot completely substitute human grading expertise, they strongly suggest that these technologies could serve as valuable assistive tools in the assessment process. The AI-generated feedback showed particular promise for helping students improve their work, with semantic similarity metrics achieving acceptable scores when compared to human-provided guidance.

Item Metadata

Title	Evaluating LLM Performance in Essay Assessment : A Comparative Analysis of AI Grading and Feedback Systems for University English Courses
Creator	Stasuik, Noah Carter
Date Issued	2025-04-28
Description	With artificial intelligence rapidly transforming industries globally, its integration into higher education appears increasingly inevitable. This thesis explores the potential of using LLMs (Large Language Models) to grade students’ essays and provide feedback on their writing in 100-level university English courses. With grading consuming significant portions of professors’ and TAs’ time, there often remains insufficient opportunity to directly engage with students. In this study, various LLM models and assessment strategies were implemented to evaluate the quality of feedback and accuracy of grades delivered by AI systems in comparison to the original human graders. All participating students consented to be evaluated by both locally hosted AI models (Llama 3.1, 3.2) as well as OpenAI’s commercial offerings (GPT-4o-mini, o1, and o3-mini). The findings indicate that while AI currently lacks the consistency necessary to fully replace human assessment, newer and more powerful LLMs demonstrate progressively better performance in both grading accuracy and feedback quality. Furthermore, when these models are combined with specialized assessment methodologies, the results show even greater accuracy in both grading and feedback semantic similarity. Although the results confirm that AI cannot completely substitute human grading expertise, they strongly suggest that these technologies could serve as valuable assistive tools in the assessment process. The AI-generated feedback showed particular promise for helping students improve their work, with semantic similarity metrics achieving acceptable scores when compared to human-provided guidance.
Genre	Graduating Project
Type	Text
Language	eng
Series	University of British Columbia. COSC_O 449
Date Available	2025-05-12
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0448868
URI	http://hdl.handle.net/2429/91113
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Peer Review Status	Unreviewed
Scholarly Level	Undergraduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Undergraduate Research