PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work

AI scoring could spur teachers to assign more writing although there are risks

by Jill Barshay May 20, 2024September 16, 2024

Grading papers is hard work. “I hate it,” a teacher friend confessed to me. And that’s a major reason why middle and high school teachers don’t assign more writing to their students. Even an efficient high school English teacher who can read and evaluate an essay in 20 minutes would spend 3,000 minutes, or 50 hours, grading if she’s teaching six classes of 25 students each. There aren’t enough hours in the day.

Website for Mind/Shift — This story also appeared in Mind/Shift

Could ChatGPT relieve teachers of some of the burden of grading papers? Early research is finding that the new artificial intelligence of large language models, also known as generative AI, is approaching the accuracy of a human in scoring essays and is likely to become even better soon. But we still don’t know whether offloading essay grading to ChatGPT will ultimately improve or harm student writing.

Tamara Tate, a researcher at University California, Irvine, and an associate director of her university’s Digital Learning Lab, is studying how teachers might use ChatGPT to improve writing instruction. Most recently, Tate and her seven-member research team, which includes writing expert Steve Graham at Arizona State University, compared how ChatGPT stacked up against humans in scoring 1,800 history and English essays written by middle and high school students.

Tate said ChatGPT was “roughly speaking, probably as good as an average busy teacher” and “certainly as good as an overburdened below-average teacher.” But, she said, ChatGPT isn’t yet accurate enough to be used on a high-stakes test or on an essay that would affect a final grade in a class.

Tate presented her study on ChatGPT essay scoring at the 2024 annual meeting of the American Educational Research Association in Philadelphia in April. (The paper is under peer review for publication and is still undergoing revision.)

Most remarkably, the researchers obtained these fairly decent essay scores from ChatGPT without training it first with sample essays. That means it is possible for any teacher to use it to grade any essay instantly with minimal expense and effort. “Teachers might have more bandwidth to assign more writing,” said Tate. “You have to be careful how you say that because you never want to take teachers out of the loop.”

Writing instruction could ultimately suffer, Tate warned, if teachers delegate too much grading to ChatGPT. Seeing students’ incremental progress and common mistakes remain important for deciding what to teach next, she said. For example, seeing loads of run-on sentences in your students’ papers might prompt a lesson on how to break them up. But if you don’t see them, you might not think to teach it.

In the study, Tate and her research team calculated that ChatGPT’s essay scores were in “fair” to “moderate” agreement with those of well-trained human evaluators. In one batch of 943 essays, ChatGPT was within a point of the human grader 89 percent of the time. On a six-point grading scale that researchers used in the study, ChatGPT often gave an essay a 2 when an expert human evaluator thought it was really a 1. But this level of agreement – within one point – dropped to 83 percent of the time in another batch of 344 English papers and slid even farther to 76 percent of the time in a third batch of 493 history essays. That means there were more instances where ChatGPT gave an essay a 4, for example, when a teacher marked it a 6. And that’s why Tate says these ChatGPT grades should only be used for low-stakes purposes in a classroom, such as a preliminary grade on a first draft.

ChatGPT scored an essay within one point of a human grader 89 percent of the time in one batch of essays

Corpus 3 refers to one batch of 943 essays, which represents more than half of the 1,800 essays that were scored in this study. Numbers highlighted in green show exact score matches between ChatGPT and a human. Yellow highlights scores in which ChatGPT was within one point of the human score. Source: Tamara Tate, University of California, Irvine (2024).

Still, this level of accuracy was impressive because even teachers disagree on how to score an essay and one-point discrepancies are common. Exact agreement, which only happens half the time between human raters, was worse for AI, which matched the human score exactly only about 40 percent of the time. Humans were far more likely to give a top grade of a 6 or a bottom grade of a 1. ChatGPT tended to cluster grades more in the middle, between 2 and 5.

Tate set up ChatGPT for a tough challenge, competing against teachers and experts with PhDs who had received three hours of training in how to properly evaluate essays. “Teachers generally receive very little training in secondary school writing and they’re not going to be this accurate,” said Tate. “This is a gold-standard human evaluator we have here.”

The raters had been paid to score these 1,800 essays as part of three earlier studies on student writing. Researchers fed these same student essays – ungraded – into ChatGPT and asked ChatGPT to score them cold. ChatGPT hadn’t been given any graded examples to calibrate its scores. All the researchers did was copy and paste an excerpt of the same scoring guidelines that the humans used, called a grading rubric, into ChatGPT and told it to “pretend” it was a teacher and score the essays on a scale of 1 to 6.

Older robo graders

Earlier versions of automated essay graders have had higher rates of accuracy. But they were expensive and time-consuming to create because scientists had to train the computer with hundreds of human-graded essays for each essay question. That’s economically feasible only in limited situations, such as for a standardized test, where thousands of students answer the same essay question.

Earlier robo graders could also be gamed, once a student understood the features that the computer system was grading for. In some cases, nonsense essays received high marks if fancy vocabulary words were sprinkled in them. ChatGPT isn’t grading for particular hallmarks, but is analyzing patterns in massive datasets of language. Tate says she hasn’t yet seen ChatGPT give a high score to a nonsense essay.

Tate expects ChatGPT’s grading accuracy to improve rapidly as new versions are released. Already, the research team has detected that the newer 4.0 version, which requires a paid subscription, is scoring more accurately than the free 3.5 version. Tate suspects that small tweaks to the grading instructions, or prompts, given to ChatGPT could improve existing versions. She is interested in testing whether ChatGPT’s scoring could become more reliable if a teacher trained it with just a few, perhaps five, sample essays that she has already graded. “Your average teacher might be willing to do that,” said Tate.

Many ed tech startups, and even well-known vendors of educational materials, are now marketing new AI essay robo graders to schools. Many of them are powered under the hood by ChatGPT or another large language model and I learned from this study that accuracy rates can be reported in ways that can make the new AI graders seem more accurate than they are. Tate’s team calculated that, on a population level, there was no difference between human and AI scores. ChatGPT can already reliably tell you the average essay score in a school or, say, in the state of California.

Questions for AI vendors

At this point, it is not as accurate in scoring an individual student. And a teacher wants to know exactly how each student is doing. Tate advises teachers and school leaders who are considering using an AI essay grader to ask specific questions about accuracy rates on the student level: What is the rate of exact agreement between the AI grader and a human rater on each essay? How often are they within one-point of each other?

The next step in Tate’s research is to study whether student writing improves after having an essay graded by ChatGPT. She’d like teachers to try using ChatGPT to score a first draft and then see if it encourages revisions, which are critical for improving writing. Tate thinks teachers could make it “almost like a game: how do I get my score up?”

Of course, it’s unclear if grades alone, without concrete feedback or suggestions for improvement, will motivate students to make revisions. Students may be discouraged by a low score from ChatGPT and give up. Many students might ignore a machine grade and only want to deal with a human they know. Still, Tate says some students are too scared to show their writing to a teacher until it’s in decent shape, and seeing their score improve on ChatGPT might be just the kind of positive feedback they need.

“We know that a lot of students aren’t doing any revision,” said Tate. “If we can get them to look at their paper again, that is already a win.”

That does give me hope, but I’m also worried that kids will just ask ChatGPT to write the whole essay for them in the first place.

This story about AI essay scoring was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

2 replies on “PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work”

At The Hechinger Report, we publish thoughtful letters from readers that contribute to the ongoing discussion about the education topics we cover. Please read our guidelines for more information. We will not consider letters that do not contain a full name and valid email address. You may submit news tips or ideas here without a full name, but not letters.

By submitting your name, you grant us permission to publish it with your letter. We will never publish your email address. You must fill out all fields to submit a letter.

Deron Marvin says:

May 28, 2024 at 1:31 pm

RE: PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work.

Schools and educators are being subjected to the latest contrivance that promises a reduction to the most drudgerous aspects of being a teacher. Often, educators are swept into a tyranny of efficiency for which tech companies repeatedly pitch to the unsuspicious. The latest? AI-enabled “personal teaching assistants”, which appear as add-ons for many school-wide information systems (including curriculum planning and reporting software and the like). These AI features claim they can eliminate or lessen the need to work through the following: Writing student reports, collaborating on lesson planning, developing personalized learning for students, analyzing data, and creating assessments . . . and now, grading papers. These conveniences are the first step in the relentless march to dehumanizing education. 

At the very core of our educational endeavor is the necessity of highly trained teachers who retain the knowledge, skills, and talents to be effective. And, most importantly, teachers must absolutely possess deep human qualities, which is portrayed as one who is passionately committed to the care and development of children (Teachers spark inspiration, turn lives around, and ratify individual student needs, repeatedly). To truly “know thy learner” a teacher must be 100 percent engaged in their students’ learning growth. That undertaking carries a litany of responsibilities, which may include mustering through observational notes, grappling with a colleague about a student’s learning, listening intently to a student’s ideas, and writing assessments with an understanding of who the students are, as learners. There are no facets of the teacher/student relationship that ought to be denuded into rudimentary and impersonal forms. Conclusively, education is principally grounded in humanism.  

Let us unfold what is happening now that we are allowing those AI-enabled components to be tested by teachers. The sales promotion, remember, is that AI will trim those burdensome aspects of teaching. For example, one company says we should use the “teaching assistant,” as it will save hours of having to write those arduous report card comments about students’ achievement. One even has a magic wand on their app that “generates a hyper-personalized comment” for progress reports. Teachers simply click on the icon to abolish the need to aggregate and grapple through the writing of a personal passage. Once the comment is written by AI, the teachers can then simply click a button to configure the appropriate voice and tone for the message: “firm” or “witty” or “serious,” to name just a few.

As a school leader, I am familiar with the onset of complaints from teachers around reporting time. I deflect complaints unabashedly knowing that the difficult process of reporting on learning results in a deeper understanding of each student. Before AI, teachers would not have fathomed having a friend or even a family member write our comments for them. Why, then, would we outsource the task to a bot, which is several shades away from the teacher, who is accountable and responsible for personally knowing their learners?
 
The rollout of ChatGPT sure scared educators. The fear was that students would never learn to compose even a paragraph if they simply reverted to ChatGPT to do it for them. It was a legitimate worry that was solved by ensuring teachers assigned work differently, and more importantly, knew their students as learners . . . and knew them well. Now a form of ChatGPT (in the shape of AI-enabled software) will allow teachers to avoid their own annoyances with writing—yes, it can even be used to write messages and letters to parents. As a parent, I would be quite disappointed to know that my child’s teacher couldn’t be bothered to write a personalized comment about my child’s learning. Or that those “witty” letters home did not represent the quiet and kind teacher I met at the beginning of the school year. 

And how about the labor of curriculum planning? The former art of taking the adopted educational program and planning in teacher teams will no longer be necessary with AI-enabled curriculum planning software. It promises to streamline lesson planning therefore taking the teacher, again, out of the learning equation. Curriculum planning is a collaborative affair and when conducted in isolation, or by someone (or something) else, can result in a school where teachers work in silos, or worse yet, have no allegiance to the lessons they are delivering. The more teachers allow AI to do their lesson planning, the more teachers will place unyielding trust in the algorithmic output from AI. In the long run, teachers will begin to trust AI over their fellow colleagues and eventually even mistrust their own work. 

Even these few “drudgeries” above are essential for strengthening teacher : student relationships. By the time you read this, educational software companies will have introduced a slew of additional AI-enabled accessories to eliminate these onerous tasks from teachers, all of which will accelerate the divorce between teachers and students. I implore educational leaders to scrutinize the use of these AI-enabled components to ensure the human connection between teachers and students are not compromised. Innocent overuse of AI-enabled tools will only goad the designers to create more of these embellishments to which will ultimately wedge itself firmly between students and teachers.
Matthew S. Johnson and Mo Zhang says:

June 10, 2024 at 11:12 am

Dear Jill,
We appreciated your insightful article, “PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work.” While we agree that the potential of AI to ease the grading burden and improve writing instruction is promising, a number of ethical issues must be addressed before we allow AI into our grading practices.

Your article mentions that AI-powered tools, specifically ChatGPT, are not yet accurate enough to be used on high-stakes tests or essays that would affect final grades.

We argue that AI accuracy is only one component of its overall suitability for grading. While accuracy is important in ensuring educational impact and integrity, fairness, bias mitigation, transparency, and explainability are equally crucial. In fact, fairness is a foundational principle of responsible AI and a key standard of educational testing.

As scientists, it is our power and responsibility to draw attention not just to the promise of AI but also to its numerous potential biases. For example, language differences or cul-tural references in student writing could lead to biased scoring, disadvantaging certain groups. AI systems in education must identify, reduce, and eliminate biases to create an inclusive environment. This involves carefully selecting training data and ensuring the AI evaluates diverse student backgrounds and abilities.

In your article, you mention a study by Dr.Tamara Tate, a researcher at the University of California, which compared how ChatGPT stacked up against humans in scoring essays written by middle and high school students. At ETS Research Institute, we conducted an experiment using the same dataset as Dr.Tate to evaluate GPT-4o’s fairness in over 12,000 essays. On average, the scores generated by ChatGPT were 0.9 points lower than human ratings and matched human scores exactly only 30% of the time. Notably, essays by Asian/Pacific Islander students received significantly lower scores from the AI compared to human raters, revealing a bias that needs addressing.

Understanding how AI makes scoring decisions and why it disadvantages certain populations remains a significant challenge, even for its developers. For instance, we found that GPT-4o could predict the race/ethnicity of essay writers more accurately than scoring essays. This suggests that the features it uses to predict race/ethnicity may also influence its scoring, contributing to fairness issues.

As we integrate AI into education, it is our collective responsibility to ensure these technologies are used ethically and effectively. Numerous agencies, such as NIST, UNESCO, and OECD, have published guidance on the responsible use of AI in education. At ETS Research Institute, we have synthesized these broad guidelines to develop principles for the responsible use of AI in assessments. Unique to educational testing, our principles include:
• Fairness and bias mitigation
• Privacy & security
• Transparency, explain ability, and accountability
• Educational impact & integrity
• Continuous improvement

Only by prioritizing fairness over hype, integrity over cost-saving, and educational impact over convenience can we create a more inclusive, reliable, and effective educational environment that truly benefits all students and educators.

We hope our perspective contributes to the ongoing dialogue about AI’s role in education.

Warm regards,

Matthew S. Johnson and Mo Zhang
ETS Research Institute

Matt Johnson is a principal research director at ETS Research Institute and a leading author of ETS Principles on the Responsible Use of AI in Assessment (ETS Research Institute, 2024). His research focuses on statistical methods in education and psychology, with a primary focus on item response theory and related models.

Mo Zhang is a senior research scientist at ETS Research Institute. She specializes in writing research, automated scoring of constructed-response items, and performance-based assessment design and validation. She currently holds two U.S. patents and has published extensively in the field of educational measurement.

Letters are closed

ChatGPT scored an essay within one point of a human grader 89 percent of the time in one batch of essays

Related articles

2 replies on “PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work”