When ChatGPT was released to the public in November 2022, advocates and watchdogs warned about the potential for racial bias. The new large language model was created by harvesting 300 billion words from books, articles and online writing, which include racist falsehoods and reflect writers’ implicit biases. Biased training data is likely to generate biased advice, answers and essays. Garbage in, garbage out.
Researchers are starting to document how AI bias manifests in unexpected ways. Inside the research and development arm of the giant testing organization ETS, which administers the SAT, a pair of investigators pitted man against machine in evaluating more than 13,000 essays written by students in grades 8 to 12. They discovered that the AI model that powers ChatGPT penalized Asian American students more than other races and ethnicities in grading the essays. This was purely a research exercise and these essays and machine scores weren’t used in any of ETS’s assessments. But the organization shared its analysis with me to warn schools and teachers about the potential for racial bias when using ChatGPT or other AI apps in the classroom.
AI and humans scored essays differently by race and ethnicity
“Take a little bit of caution and do some evaluation of the scores before presenting them to students,” said Mo Zhang, one of the ETS researchers who conducted the analysis. “There are methods for doing this and you don’t want to take people who specialize in educational measurement out of the equation.”
That might sound self-serving for an employee of a company that specializes in educational measurement. But Zhang’s advice is worth heeding in the excitement to try new AI technology. There are potential dangers as teachers save time by offloading grading work to a robot.
In ETS’s analysis, Zhang and her colleague Matt Johnson fed 13,121 essays into one of the latest versions of the AI model that powers ChatGPT, called GPT 4 Omni or simply GPT-4o. (This version was added to ChatGPT in May 2024, but when the researchers conducted this experiment they used the latest AI model through a different portal.)
A little background about this large bundle of essays: students across the nation had originally written these essays between 2015 and 2019 as part of state standardized exams or classroom assessments. Their assignment had been to write an argumentative essay, such as “Should students be allowed to use cell phones in school?” The essays were collected to help scientists develop and test automated writing evaluation.
Each of the essays had been graded by expert raters of writing on a 1-to-6 point scale with 6 being the highest score. ETS asked GPT-4o to score them on the same six-point scale using the same scoring guide that the humans used. Neither man nor machine was told the race or ethnicity of the student, but researchers could see students’ demographic information in the datasets that accompany these essays.
GPT-4o marked the essays almost a point lower than the humans did. The average score across the 13,121 essays was 2.8 for GPT-4o and 3.7 for the humans. But Asian Americans were docked by an additional quarter point. Human evaluators gave Asian Americans a 4.3, on average, while GPT-4o gave them only a 3.2 – roughly a 1.1 point deduction. By contrast, the score difference between humans and GPT-4o was only about 0.9 points for white, Black and Hispanic students. Imagine an ice cream truck that kept shaving off an extra quarter scoop only from the cones of Asian American kids.
“Clearly, this doesn’t seem fair,” wrote Johnson and Zhang in an unpublished report they shared with me. Though the extra penalty for Asian Americans wasn’t terribly large, they said, it’s substantial enough that it shouldn’t be ignored.
The researchers don’t know why GPT-4o issued lower grades than humans, and why it gave an extra penalty to Asian Americans. Zhang and Johnson described the AI system as a “huge black box” of algorithms that operate in ways “not fully understood by their own developers.” That inability to explain a student’s grade on a writing assignment makes the systems especially frustrating to use in schools.
This one study isn’t proof that AI is consistently underrating essays or biased against Asian Americans. Other versions of AI sometimes produce different results. A separate analysis of essay scoring by researchers from University of California, Irvine and Arizona State University found that AI essay grades were just as frequently too high as they were too low. That study, which used the 3.5 version of ChatGPT, did not scrutinize results by race and ethnicity.
I wondered if AI bias against Asian Americans was somehow connected to high achievement. Just as Asian Americans tend to score high on math and reading tests, Asian Americans, on average, were the strongest writers in this bundle of 13,000 essays. Even with the penalty, Asian Americans still had the highest essay scores, well above those of white, Black, Hispanic, Native American or multi-racial students.
In both the ETS and UC-ASU essay studies, AI awarded far fewer perfect scores than humans did. For example, in this ETS study, humans awarded 732 perfect 6s, while GPT-4o gave out a grand total of only three. GPT’s stinginess with perfect scores might have affected a lot of Asian Americans who had received 6s from human raters.
ETS’s researchers had asked GPT-4o to score the essays cold, without showing the chatbot any graded examples to calibrate its scores. It’s possible that a few sample essays or small tweaks to the grading instructions, or prompts, given to ChatGPT could reduce or eliminate the bias against Asian Americans. Perhaps the robot would be fairer to Asian Americans if it were explicitly prompted to “give out more perfect 6s.”
The ETS researchers told me this wasn’t the first time that they’ve noticed Asian students treated differently by a robo-grader. Older automated essay graders, which used different algorithms, have sometimes done the opposite, giving Asians higher marks than human raters did. For example, an ETS automated scoring system developed more than a decade ago, called e-rater, tended to inflate scores for students from Korea, China, Taiwan and Hong Kong on their essays for the Test of English as a Foreign Language (TOEFL), according to a study published in 2012. That may have been because some Asian students had memorized well-structured paragraphs, while humans easily noticed that the essays were off-topic. (The ETS website says it only relies on the e-rater score alone for practice tests, and uses it in conjunction with human scores for actual exams.)
Asian Americans also garnered higher marks from an automated scoring system created during a coding competition in 2021 and powered by BERT, which had been the most advanced algorithm before the current generation of large language models, such as GPT. Computer scientists put their experimental robo-grader through a series of tests and discovered that it gave higher scores than humans did to Asian Americans’ open-response answers on a reading comprehension test.
It was also unclear why BERT sometimes treated Asian Americans differently. But it illustrates how important it is to test these systems before we unleash them in schools. Based on educator enthusiasm, however, I fear this train has already left the station. In recent webinars, I’ve seen many teachers post in the chat window that they’re already using ChatGPT, Claude and other AI-powered apps to grade writing. That might be a time saver for teachers, but it could also be harming students.
This story about AI bias was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.
Call me a relic, from the past, but AI scoring seems too impersonal. As a math teacher, I love my subject. I like correcting papers. It gives me a “feel” for my students’ performance. I can see their approaches. I can see their understanding, or lack of understanding. If a machine is doing your grading, you get none of that. The teacher becomes a faceless clerk, in a store.
I don’t think that teachers should use AI for grading essays. Perhaps I am a relic too, but I think the AI grader does not catch subtlety or ironic humor in writing. That said, I am a Ph.D. who tutors Asian American students and their writing may well be superior. These kids not only work hard in school, but take tutorials beyond the classroom, most often in writing. Even when I was in college at U.Va. Asian-American students were often harder working and earned better grades. And for catching plagiarism, I think a teacher needs to know their students’ writing at all levels of progress. I have caught several plagiarists and cheaters (there is a difference) by assessing both raw writing and a portfolio of student writing in different areas. Teachers that allow an AI to grade are lazy and it makes sense considering the low pay and lack of respect teachers receive. If you want to improve your education system, stop denying tenure and pay academics what they are worth. There are also tenured teachers at high schools. The way we are approaching education will prove a detriment to student progress and the education and intelligence of American students. It is a shame!
Looking at the data I noticed that Hispanic and Black student were still marked more harshly by AI than others. Their AI marks were only 73 percent of the human-graded marks, while Asian and Indigenous recieved 75 percent and mixed race and white students 77 percent. In other words, the difference for Asian students looked greater because their scores were larger to begin with. Nonetheless, there is obviously still racial bias in AI but it simply reinforces the human bias already present.
Kelly’s points about both the math and reinforcement effects seem correct, as do DuWayne’s and TK’s points about the threat to human connection. It seems to me that the value of AI comes from its expedited and detailed feedback and its crowd-sourced standardization of teacher-established criteria, however imperfect. To the extent it saves teachers’ time, it allows for more, not less, direct connection with students and advice to be considered, but not automatically used, in scoring since teachers have a more holistic knowledge of students than LLMs–at least so far. Perhaps students and teachers should learn to collaborate with ever-improving AI models to engage students in the purpose of building skills and knowledge to prepare them to excel in a world in which AI will only become more and more potent over time. And in our education system, especially since looping is rare, AI will likely overtake many individual teachers concerning their knowledge of students and their growth. Imagine when each student has a custom GPT with collections of their work product and teacher feedback over several years and many subjects, and the ability to brief teachers about each new student and recommend the next steps in their learning journey. This will make effective student-family-teacher collaboration even more critical than it is today.