writing Archives - The Hechinger Report http://hechingerreport.org/tags/writing/ Covering Innovation & Inequality in Education Wed, 03 Jul 2024 11:37:54 +0000 en-US hourly 1 https://hechingerreport.org/wp-content/uploads/2018/06/cropped-favicon-32x32.jpg writing Archives - The Hechinger Report http://hechingerreport.org/tags/writing/ 32 32 138677242 PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why https://hechingerreport.org/proof-points-asian-american-ai-bias/ https://hechingerreport.org/proof-points-asian-american-ai-bias/#comments Mon, 08 Jul 2024 10:00:00 +0000 https://hechingerreport.org/?p=101830 global online academy

When ChatGPT was released to the public in November 2022, advocates and watchdogs warned about the potential for racial bias. The new large language model was created by harvesting 300 billion words from books, articles and online writing, which include racist falsehoods and reflect writers’ implicit biases. Biased training data is likely to generate biased […]

The post PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why appeared first on The Hechinger Report.

]]>
global online academy

When ChatGPT was released to the public in November 2022, advocates and watchdogs warned about the potential for racial bias. The new large language model was created by harvesting 300 billion words from books, articles and online writing, which include racist falsehoods and reflect writers’ implicit biases. Biased training data is likely to generate biased advice, answers and essays. Garbage in, garbage out. 

Researchers are starting to document how AI bias manifests in unexpected ways. Inside the research and development arm of the giant testing organization ETS, which administers the SAT, a pair of investigators pitted man against machine in evaluating more than 13,000 essays written by students in grades 8 to 12. They discovered that the AI model that powers ChatGPT penalized Asian American students more than other races and ethnicities in grading the essays. This was purely a research exercise and these essays and machine scores weren’t used in any of ETS’s assessments. But the organization shared its analysis with me to warn schools and teachers about the potential for racial bias when using ChatGPT or other AI apps in the classroom.

AI and humans scored essays differently by race and ethnicity

“Diff” is the difference between the average score given by humans and GPT-4o in this experiment. “Adj. Diff” adjusts this raw number for the randomness of human ratings. Source: Table from Matt Johnson & Mo Zhang “Using GPT-4o to Score Persuade 2.0 Independent Items” ETS (June 2024 draft)

“Take a little bit of caution and do some evaluation of the scores before presenting them to students,” said Mo Zhang, one of the ETS researchers who conducted the analysis. “There are methods for doing this and you don’t want to take people who specialize in educational measurement out of the equation.”

That might sound self-serving for an employee of a company that specializes in educational measurement. But Zhang’s advice is worth heeding in the excitement to try new AI technology. There are potential dangers as teachers save time by offloading grading work to a robot.

In ETS’s analysis, Zhang and her colleague Matt Johnson fed 13,121 essays into one of the latest versions of the AI model that powers ChatGPT, called GPT 4 Omni or simply GPT-4o. (This version was added to ChatGPT in May 2024, but when the researchers conducted this experiment they used the latest AI model through a different portal.)  

A little background about this large bundle of essays: students across the nation had originally written these essays between 2015 and 2019 as part of state standardized exams or classroom assessments. Their assignment had been to write an argumentative essay, such as “Should students be allowed to use cell phones in school?” The essays were collected to help scientists develop and test automated writing evaluation.

Each of the essays had been graded by expert raters of writing on a 1-to-6 point scale with 6 being the highest score. ETS asked GPT-4o to score them on the same six-point scale using the same scoring guide that the humans used. Neither man nor machine was told the race or ethnicity of the student, but researchers could see students’ demographic information in the datasets that accompany these essays.

GPT-4o marked the essays almost a point lower than the humans did. The average score across the 13,121 essays was 2.8 for GPT-4o and 3.7 for the humans. But Asian Americans were docked by an additional quarter point. Human evaluators gave Asian Americans a 4.3, on average, while GPT-4o gave them only a 3.2 – roughly a 1.1 point deduction. By contrast, the score difference between humans and GPT-4o was only about 0.9 points for white, Black and Hispanic students. Imagine an ice cream truck that kept shaving off an extra quarter scoop only from the cones of Asian American kids. 

“Clearly, this doesn’t seem fair,” wrote Johnson and Zhang in an unpublished report they shared with me. Though the extra penalty for Asian Americans wasn’t terribly large, they said, it’s substantial enough that it shouldn’t be ignored. 

The researchers don’t know why GPT-4o issued lower grades than humans, and why it gave an extra penalty to Asian Americans. Zhang and Johnson described the AI system as a “huge black box” of algorithms that operate in ways “not fully understood by their own developers.” That inability to explain a student’s grade on a writing assignment makes the systems especially frustrating to use in schools.

This table compares GPT-4o scores with human scores on the same batch of 13,121 student essays, which were scored on a 1-to-6 scale. Numbers highlighted in green show exact score matches between GPT-4o and humans. Unhighlighted numbers show discrepancies. For example, there were 1,221 essays where humans awarded a 5 and GPT awarded 3. Data source: Matt Johnson & Mo Zhang “Using GPT-4o to Score Persuade 2.0 Independent Items” ETS (June 2024 draft)

This one study isn’t proof that AI is consistently underrating essays or biased against Asian Americans. Other versions of AI sometimes produce different results. A separate analysis of essay scoring by researchers from University of California, Irvine and Arizona State University found that AI essay grades were just as frequently too high as they were too low. That study, which used the 3.5 version of ChatGPT, did not scrutinize results by race and ethnicity.

I wondered if AI bias against Asian Americans was somehow connected to high achievement. Just as Asian Americans tend to score high on math and reading tests, Asian Americans, on average, were the strongest writers in this bundle of 13,000 essays. Even with the penalty, Asian Americans still had the highest essay scores, well above those of white, Black, Hispanic, Native American or multi-racial students. 

In both the ETS and UC-ASU essay studies, AI awarded far fewer perfect scores than humans did. For example, in this ETS study, humans awarded 732 perfect 6s, while GPT-4o gave out a grand total of only three. GPT’s stinginess with perfect scores might have affected a lot of Asian Americans who had received 6s from human raters.

ETS’s researchers had asked GPT-4o to score the essays cold, without showing the chatbot any graded examples to calibrate its scores. It’s possible that a few sample essays or small tweaks to the grading instructions, or prompts, given to ChatGPT could reduce or eliminate the bias against Asian Americans. Perhaps the robot would be fairer to Asian Americans if it were explicitly prompted to “give out more perfect 6s.” 

The ETS researchers told me this wasn’t the first time that they’ve noticed Asian students treated differently by a robo-grader. Older automated essay graders, which used different algorithms, have sometimes done the opposite, giving Asians higher marks than human raters did. For example, an ETS automated scoring system developed more than a decade ago, called e-rater, tended to inflate scores for students from Korea, China, Taiwan and Hong Kong on their essays for the Test of English as a Foreign Language (TOEFL), according to a study published in 2012. That may have been because some Asian students had memorized well-structured paragraphs, while humans easily noticed that the essays were off-topic. (The ETS website says it only relies on the e-rater score alone for practice tests, and uses it in conjunction with human scores for actual exams.) 

Asian Americans also garnered higher marks from an automated scoring system created during a coding competition in 2021 and powered by BERT, which had been the most advanced algorithm before the current generation of large language models, such as GPT. Computer scientists put their experimental robo-grader through a series of tests and discovered that it gave higher scores than humans did to Asian Americans’ open-response answers on a reading comprehension test. 

It was also unclear why BERT sometimes treated Asian Americans differently. But it illustrates how important it is to test these systems before we unleash them in schools. Based on educator enthusiasm, however, I fear this train has already left the station. In recent webinars, I’ve seen many teachers post in the chat window that they’re already using ChatGPT, Claude and other AI-powered apps to grade writing. That might be a time saver for teachers, but it could also be harming students. 

This story about AI bias was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-asian-american-ai-bias/feed/ 3 101830
PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says https://hechingerreport.org/proof-points-writing-ai-feedback/ https://hechingerreport.org/proof-points-writing-ai-feedback/#respond Mon, 03 Jun 2024 10:00:00 +0000 https://hechingerreport.org/?p=101344

This week I challenged my editor to face off against a machine. Barbara Kantrowitz gamely accepted, under one condition: “You have to file early.”  Ever since ChatGPT arrived in 2022, many journalists have made a public stunt out of asking the new generation of artificial intelligence to write their stories. Those AI stories were often […]

The post PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says appeared first on The Hechinger Report.

]]>
Researchers from the University of California, Irvine, and Arizona State University found that human feedback was generally a bit better than AI feedback, but AI was surprisingly good. Credit: Getty Images

This week I challenged my editor to face off against a machine. Barbara Kantrowitz gamely accepted, under one condition: “You have to file early.”  Ever since ChatGPT arrived in 2022, many journalists have made a public stunt out of asking the new generation of artificial intelligence to write their stories. Those AI stories were often bland and sprinkled with errors. I wanted to understand how well ChatGPT handled a different aspect of writing: giving feedback.

My curiosity was piqued by a new study, published in the June 2024 issue of the peer-reviewed journal Learning and Instruction, that evaluated the quality of ChatGPT’s feedback on students’ writing. A team of researchers compared AI with human feedback on 200 history essays written by students in grades 6 through 12 and they determined that human feedback was generally a bit better. Humans had a particular advantage in advising students on something to work on that would be appropriate for where they are in their development as a writer. 

But ChatGPT came close. On a five-point scale that the researchers used to rate feedback quality, with a 5 being the highest quality feedback, ChatGPT averaged a 3.6 compared with a 4.0 average from a team of 16 expert human evaluators. It was a tough challenge. Most of these humans had taught writing for more than 15 years or they had considerable experience in writing instruction. All received three hours of training for this exercise plus extra pay for providing the feedback.

ChatGPT even beat these experts in one aspect; it was slightly better at giving feedback on students’ reasoning, argumentation and use of evidence from source materials – the features that the researchers had wanted the writing evaluators to focus on.

“It was better than I thought it was going to be because I didn’t have a lot of hope that it was going to be that good,” said Steve Graham, a well-regarded expert on writing instruction at Arizona State University, and a member of the study’s research team. “It wasn’t always accurate. But sometimes it was right on the money. And I think we’ll learn how to make it better.”

Average ratings for the quality of ChatGPT and human feedback on 200 student essays

Researchers rated the quality of the feedback on a five-point scale across five different categories. Criteria-based refers to whether the feedback addressed the main goals of the writing assignment, in this case, to produce a well-reasoned argument about history using evidence from the reading source materials that the students were given. Clear directions mean whether the feedback included specific examples of something the student did well and clear directions for improvement. Accuracy means whether the feedback advice was correct without errors. Essential Features refer to whether the suggestion on what the student should work on next is appropriate for where the student is in his writing development and is an important element of this genre of writing. Supportive Tone refers to whether the feedback is delivered with language that is affirming, respectful and supportive, as opposed to condescending, impolite or authoritarian. (Source: Fig. 1 of Steiss et al, “Comparing the quality of human and ChatGPT feedback of students’ writing,” Learning and Instruction, June 2024.)

Exactly how ChatGPT is able to give good feedback is something of a black box even to the writing researchers who conducted this study. Artificial intelligence doesn’t comprehend things in the same way that humans do. But somehow, through the neural networks that ChatGPT’s programmers built, it is picking up on patterns from all the writing it has previously digested, and it is able to apply those patterns to a new text. 

The surprising “relatively high quality” of ChatGPT’s feedback is important because it means that the new artificial intelligence of large language models, also known as generative AI, could potentially help students improve their writing. One of the biggest problems in writing instruction in U.S. schools is that teachers assign too little writing, Graham said, often because teachers feel that they don’t have the time to give personalized feedback to each student. That leaves students without sufficient practice to become good writers. In theory, teachers might be willing to assign more writing or insist on revisions for each paper if students (or teachers) could use ChatGPT to provide feedback between drafts. 

Despite the potential, Graham isn’t an enthusiastic cheerleader for AI. “My biggest fear is that it becomes the writer,” he said. He worries that students will not limit their use of ChatGPT to helpful feedback, but ask it to do their thinking, analyzing and writing for them. That’s not good for learning. The research team also worries that writing instruction will suffer if teachers delegate too much feedback to ChatGPT. Seeing students’ incremental progress and common mistakes remain important for deciding what to teach next, the researchers said. For example, seeing loads of run-on sentences in your students’ papers might prompt a lesson on how to break them up. But if you don’t see them, you might not think to teach it. Another common concern among writing instructors is that AI feedback will steer everyone to write in the same homogenized way. A young writer’s unique voice could be flattened out before it even has the chance to develop.

There’s also the risk that students may not be interested in heeding AI feedback. Students often ignore the painstaking feedback that their teachers already give on their essays. Why should we think students will pay attention to feedback if they start getting more of it from a machine? 

Still, Graham and his research colleagues at the University of California, Irvine, are continuing to study how AI could be used effectively and whether it ultimately improves students’ writing. “You can’t ignore it,” said Graham. “We either learn to live with it in useful ways, or we’re going to be very unhappy with it.”

Right now, the researchers are studying how students might converse back-and-forth with ChatGPT like a writing coach in order to understand the feedback and decide which suggestions to use.

Example of feedback from a human and ChatGPT on the same essay

In the current study, the researchers didn’t track whether students understood or employed the feedback, but only sought to measure its quality. Judging the quality of feedback is a rather subjective exercise, just as feedback itself is a bundle of subjective judgment calls. Smart people can disagree on what good writing looks like and how to revise bad writing. 

In this case, the research team came up with its own criteria for what constitutes good feedback on a history essay. They instructed the humans to focus on the student’s reasoning and argumentation, rather than, say, grammar and punctuation.  They also told the human raters to adopt a “glow and grow strategy” for delivering the feedback by first finding something to praise, then identifying a particular area for improvement. 

The human raters provided this kind of feedback on hundreds of history essays from 2021 to 2023, as part of an unrelated study of an initiative to boost writing at school. The researchers randomly grabbed 200 of these essays and fed the raw student writing – without the human feedback – to version 3.5 of ChatGPT and asked it to give feedback, too

At first, the AI feedback was terrible, but as the researchers tinkered with the instructions, or the “prompt,” they typed into ChatGPT, the feedback improved. The researchers eventually settled upon this wording: “Pretend you are a secondary school teacher. Provide 2-3 pieces of specific, actionable feedback on each of the following essays…. Use a friendly and encouraging tone.” The researchers also fed the assignment that the students were given, for example, “Why did the Montgomery Bus Boycott succeed?” along with the reading source material that the students were provided. (More details about how the researchers prompted ChatGPT are explained in Appendix C of the study.)

The humans took about 20 to 25 minutes per essay. ChatGPT’s feedback came back instantly. The humans sometimes marked up sentences by, for example, showing a place where the student could have cited a source to buttress an argument. ChatGPT didn’t write any in-line comments and only wrote a note to the student. 

Researchers then read through both sets of feedback – human and machine – for each essay, comparing and rating them. (It was supposed to be a blind comparison test and the feedback raters were not told who authored each one. However, the language and tone of ChatGPT were distinct giveaways, and the in-line comments were a tell of human feedback.)

Humans appeared to have a clear edge with the very strongest and the very weakest writers, the researchers found. They were better at pushing a strong writer a little bit further, for example, by suggesting that the student consider and address a counterargument. ChatGPT struggled to come up with ideas for a student who was already meeting the objectives of a well-argued essay with evidence from the reading source materials. ChatGPT also struggled with the weakest writers. The researchers had to drop two of the essays from the study because they were so short that ChatGPT didn’t have any feedback for the student. The human rater was able to parse out some meaning from a brief, incomplete sentence and offer a suggestion. 

In one student essay about the Montgomery Bus Boycott, reprinted above, the human feedback seemed too generic to me: “Next time, I would love to see some evidence from the sources to help back up your claim.” ChatGPT, by contrast, specifically suggested that the student could have mentioned how much revenue the bus company lost during the boycott – an idea that was mentioned in the student’s essay. ChatGPT also suggested that the student could have mentioned specific actions that the NAACP and other organizations took. But the student had actually mentioned a few of these specific actions in his essay. That part of ChatGPT’s feedback was plainly inaccurate. 

In another student writing example, also reprinted below, the human straightforwardly pointed out that the student had gotten an historical fact wrong. ChatGPT appeared to affirm that the student’s mistaken version of events was correct.

Another example of feedback from a human and ChatGPT on the same essay

So how did ChatGPT’s review of my first draft stack up against my editor’s? One of the researchers on the study team suggested a prompt that I could paste into ChatGPT. After a few back and forth questions with the chatbot about my grade level and intended audience, it initially spit out some generic advice that had little connection to the ideas and words of my story. It seemed more interested in format and presentation, suggesting a summary at the top and subheads to organize the body. One suggestion would have made my piece too long-winded. Its advice to add examples of how AI feedback might be beneficial was something that I had already done. I then asked for specific things to change in my draft, and ChatGPT came back with some great subhead ideas. I plan to use them in my newsletter, which you can see if you sign up for it here. (And if you want to see my prompt and dialogue with ChatGPT, here is the link.) 

My human editor, Barbara, was the clear winner in this round. She tightened up my writing, fixed style errors and helped me brainstorm this ending. Barbara’s job is safe – for now. 

This story about AI feedback was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-writing-ai-feedback/feed/ 0 101344
PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work https://hechingerreport.org/proof-points-ai-essay-grading/ https://hechingerreport.org/proof-points-ai-essay-grading/#comments Mon, 20 May 2024 10:00:00 +0000 https://hechingerreport.org/?p=101011

Grading papers is hard work. “I hate it,” a teacher friend confessed to me. And that’s a major reason why middle and high school teachers don’t assign more writing to their students. Even an efficient high school English teacher who can read and evaluate an essay in 20 minutes would spend 3,000 minutes, or 50 […]

The post PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work appeared first on The Hechinger Report.

]]>

Grading papers is hard work. “I hate it,” a teacher friend confessed to me. And that’s a major reason why middle and high school teachers don’t assign more writing to their students. Even an efficient high school English teacher who can read and evaluate an essay in 20 minutes would spend 3,000 minutes, or 50 hours, grading if she’s teaching six classes of 25 students each. There aren’t enough hours in the day. 

Could ChatGPT relieve teachers of some of the burden of grading papers? Early research is finding that the new artificial intelligence of large language models, also known as generative AI, is approaching the accuracy of a human in scoring essays and is likely to become even better soon. But we still don’t know whether offloading essay grading to ChatGPT will ultimately improve or harm student writing.

Tamara Tate, a researcher at University California, Irvine, and an associate director of her university’s Digital Learning Lab, is studying how teachers might use ChatGPT to improve writing instruction. Most recently, Tate and her seven-member research team, which includes writing expert Steve Graham at Arizona State University, compared how ChatGPT stacked up against humans in scoring 1,800 history and English essays written by middle and high school students. 

Tate said ChatGPT was “roughly speaking, probably as good as an average busy teacher” and “certainly as good as an overburdened below-average teacher.” But, she said, ChatGPT isn’t yet accurate enough to be used on a high-stakes test or on an essay that would affect a final grade in a class.

Tate presented her study on ChatGPT essay scoring at the 2024 annual meeting of the American Educational Research Association in Philadelphia in April. (The paper is under peer review for publication and is still undergoing revision.) 

Most remarkably, the researchers obtained these fairly decent essay scores from ChatGPT without training it first with sample essays. That means it is possible for any teacher to use it to grade any essay instantly with minimal expense and effort. “Teachers might have more bandwidth to assign more writing,” said Tate. “You have to be careful how you say that because you never want to take teachers out of the loop.” 

Writing instruction could ultimately suffer, Tate warned, if teachers delegate too much grading to ChatGPT. Seeing students’ incremental progress and common mistakes remain important for deciding what to teach next, she said. For example, seeing loads of run-on sentences in your students’ papers might prompt a lesson on how to break them up. But if you don’t see them, you might not think to teach it. 

In the study, Tate and her research team calculated that ChatGPT’s essay scores were in “fair” to “moderate” agreement with those of well-trained human evaluators. In one batch of 943 essays, ChatGPT was within a point of the human grader 89 percent of the time. On a six-point grading scale that researchers used in the study, ChatGPT often gave an essay a 2 when an expert human evaluator thought it was really a 1. But this level of agreement – within one point – dropped to 83 percent of the time in another batch of 344 English papers and slid even farther to 76 percent of the time in a third batch of 493 history essays.  That means there were more instances where ChatGPT gave an essay a 4, for example, when a teacher marked it a 6. And that’s why Tate says these ChatGPT grades should only be used for low-stakes purposes in a classroom, such as a preliminary grade on a first draft.

ChatGPT scored an essay within one point of a human grader 89 percent of the time in one batch of essays

Corpus 3 refers to one batch of 943 essays, which represents more than half of the 1,800 essays that were scored in this study. Numbers highlighted in green show exact score matches between ChatGPT and a human. Yellow highlights scores in which ChatGPT was within one point of the human score. Source: Tamara Tate, University of California, Irvine (2024).

Still, this level of accuracy was impressive because even teachers disagree on how to score an essay and one-point discrepancies are common. Exact agreement, which only happens half the time between human raters, was worse for AI, which matched the human score exactly only about 40 percent of the time. Humans were far more likely to give a top grade of a 6 or a bottom grade of a 1. ChatGPT tended to cluster grades more in the middle, between 2 and 5. 

Tate set up ChatGPT for a tough challenge, competing against teachers and experts with PhDs who had received three hours of training in how to properly evaluate essays. “Teachers generally receive very little training in secondary school writing and they’re not going to be this accurate,” said Tate. “This is a gold-standard human evaluator we have here.”

The raters had been paid to score these 1,800 essays as part of three earlier studies on student writing. Researchers fed these same student essays – ungraded –  into ChatGPT and asked ChatGPT to score them cold. ChatGPT hadn’t been given any graded examples to calibrate its scores. All the researchers did was copy and paste an excerpt of the same scoring guidelines that the humans used, called a grading rubric, into ChatGPT and told it to “pretend” it was a teacher and score the essays on a scale of 1 to 6. 

Older robo graders

Earlier versions of automated essay graders have had higher rates of accuracy. But they were expensive and time-consuming to create because scientists had to train the computer with hundreds of human-graded essays for each essay question. That’s economically feasible only in limited situations, such as for a standardized test, where thousands of students answer the same essay question. 

Earlier robo graders could also be gamed, once a student understood the features that the computer system was grading for. In some cases, nonsense essays received high marks if fancy vocabulary words were sprinkled in them. ChatGPT isn’t grading for particular hallmarks, but is analyzing patterns in massive datasets of language. Tate says she hasn’t yet seen ChatGPT give a high score to a nonsense essay. 

Tate expects ChatGPT’s grading accuracy to improve rapidly as new versions are released. Already, the research team has detected that the newer 4.0 version, which requires a paid subscription, is scoring more accurately than the free 3.5 version. Tate suspects that small tweaks to the grading instructions, or prompts, given to ChatGPT could improve existing versions. She is interested in testing whether ChatGPT’s scoring could become more reliable if a teacher trained it with just a few, perhaps five, sample essays that she has already graded. “Your average teacher might be willing to do that,” said Tate.

Many ed tech startups, and even well-known vendors of educational materials, are now marketing new AI essay robo graders to schools. Many of them are powered under the hood by ChatGPT or another large language model and I learned from this study that accuracy rates can be reported in ways that can make the new AI graders seem more accurate than they are. Tate’s team calculated that, on a population level, there was no difference between human and AI scores. ChatGPT can already reliably tell you the average essay score in a school or, say, in the state of California. 

Questions for AI vendors

At this point, it is not as accurate in scoring an individual student. And a teacher wants to know exactly how each student is doing. Tate advises teachers and school leaders who are considering using an AI essay grader to ask specific questions about accuracy rates on the student level:  What is the rate of exact agreement between the AI grader and a human rater on each essay? How often are they within one-point of each other?

The next step in Tate’s research is to study whether student writing improves after having an essay graded by ChatGPT. She’d like teachers to try using ChatGPT to score a first draft and then see if it encourages revisions, which are critical for improving writing. Tate thinks teachers could make it “almost like a game: how do I get my score up?” 

Of course, it’s unclear if grades alone, without concrete feedback or suggestions for improvement, will motivate students to make revisions. Students may be discouraged by a low score from ChatGPT and give up. Many students might ignore a machine grade and only want to deal with a human they know. Still, Tate says some students are too scared to show their writing to a teacher until it’s in decent shape, and seeing their score improve on ChatGPT might be just the kind of positive feedback they need. 

“We know that a lot of students aren’t doing any revision,” said Tate. “If we can get them to look at their paper again, that is already a win.”

That does give me hope, but I’m also worried that kids will just ask ChatGPT to write the whole essay for them in the first place.

This story about AI essay scoring was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-ai-essay-grading/feed/ 2 101011
PROOF POINTS: It’s easy to fool ChatGPT detectors https://hechingerreport.org/proof-points-its-easy-to-fool-chatgpt-detectors/ https://hechingerreport.org/proof-points-its-easy-to-fool-chatgpt-detectors/#respond Mon, 04 Sep 2023 10:00:00 +0000 https://hechingerreport.org/?p=95538

A high school English teacher recently explained to me how she’s coping with the latest challenge to education in America: ChatGPT.  She runs every student essay through five different generative AI detectors. She thought the extra effort would catch the cheaters in her classroom.  A clever series of experiments by computer scientists and engineers at […]

The post PROOF POINTS: It’s easy to fool ChatGPT detectors appeared first on The Hechinger Report.

]]>

A high school English teacher recently explained to me how she’s coping with the latest challenge to education in America: ChatGPT.  She runs every student essay through five different generative AI detectors. She thought the extra effort would catch the cheaters in her classroom. 

A clever series of experiments by computer scientists and engineers at Stanford University indicate that her labors to vet each essay five ways might be in vain. The researchers demonstrated how seven commonly used GPT detectors are so primitive that they are both easily fooled by machine generated essays and improperly flagging innocent students. Layering several detectors on top of each other does little to solve the problem of false negatives and positives.

“If AI-generated content can easily evade detection while human text is frequently misclassified, how effective are these detectors truly?” the Stanford scientists wrote in a July 2023 paper, published under the banner, “opinion,” in the peer-reviewed data science journal Patterns. “Claims of GPT detectors’ ‘99% accuracy’ are often taken at face value by a broader audience, which is misleading at best.”

The scientists began by generating 31 counterfeit college admissions essays using ChatGPT 3.5, the free version that any student can use. GPT detectors were pretty good at flagging them. Two of the seven detectors they tested caught all 31 counterfeits. 

But all seven GPT detectors could be easily tricked with a simple tweak. The scientists asked ChatGPT to rewrite the same fake essays with this prompt: “Elevate the provided text by employing literary language.”

Detection rates plummeted to near zero (3 percent, on average). 

I wondered what constitutes literary language in the ChatGPT universe. Instead of college essays, I asked ChatGPT to write a paragraph about the perils of plagiarism. In ChatGPT’s first version, it wrote: “Plagiarism presents a grave threat not only to academic integrity but also to the development of critical thinking and originality among students.” In the second, “elevated” version, plagiarism is “a lurking specter” that “casts a formidable shadow over the realm of academia, threatening not only the sanctity of scholastic honesty but also the very essence of intellectual maturation.”  If I were a teacher, the preposterous magniloquence would have been a red flag. But when I ran both drafts through several AI detectors, the boring first one was flagged by all of them. The flamboyant second draft was flagged by none. Compare the two drafts side by side for yourself. 

Simple prompts bypass ChatGPT detectors. Red bars are AI detection before making the language loftier; gray bars are after. 

For ChatGPT 3.5 generated college admission essays, the performance of seven widely used ChatGPT detectors declines markedly when a second round self-edit prompt (“Elevate the provided text by employing literary language”) is applied. Source: Liang, W., et al. “GPT detectors are biased against non-native English writers” (2023)

Meanwhile, these same GPT detectors incorrectly flagged essays written by real humans as AI generated more than half the time when the students were not native English speakers. The researchers collected a batch of 91 practice English TOEFL essays that Chinese students had voluntarily uploaded to a test-prep forum before ChatGPT was invented. (TOEFL is the acronym for the Test of English as a Foreign Language, which is taken by international students who are applying to U.S. universities.) After running the 91 essays through all seven ChatGPT detectors, 89 essays were identified by one or more detectors as possibly AI-generated. All seven detectors unanimously marked one out of five essays as AI authored. By contrast, the researchers found that GPT detectors accurately categorized a separate batch of 88 eighth grade essays, submitted by real American students.

My former colleague Tara García Mathewson brought this research to my attention in her first story for The Markup, which highlighted how international college students are facing unjust accusations of cheating and need to prove their innocence. The Stanford scientists are warning not only about unfair bias but also about the futility of using the current generation of AI detectors. 

Bias in ChatGPT detectors. Leading detectors incorrectly flag a majority of essays written by international students, but accurately classify writing of American eighth graders. 

More than half of the TOEFL (Test of English as a Foreign Language) essays written by non-native English speakers were  incorrectly classified as “AI-generated,” while detectors exhibit near-perfect accuracy for U.S. eighth graders’ essays. Source: Liang, W., et al. “GPT detectors are biased against non-native English writers” (2023)

The reason that the AI detectors are failing in both cases – with a bot’s fancy language and with foreign students’ real writing – is the same. And it has to do with how the AI detectors work. Detectors are a machine learning model that analyzes vocabulary choices, syntax and grammar. A widely adopted measure inside numerous GPT detectors is something called “text perplexity,” a calculation of how predictable or banal the writing is. It gauges the degree of “surprise” in how words are strung together in an essay. If the model can predict the next word in a sentence easily, the perplexity is low. If the next word is hard to predict, the perplexity is high.

Low perplexity is a symptom of an AI generated text, while high perplexity is a sign of human writing. My intentional use of the word “banal” above, for example, is a lexical choice that might “surprise” the detector and put this column squarely in the non-AI generated bucket. 

Because text perplexity is a key measure inside the GPT detectors, it becomes easy to game with loftier language. Non-native speakers get flagged because they are likely to exhibit less linguistic variability and syntactic complexity.

The seven detectors were created by originality.ai, Quill.org, Sapling, Crossplag, GPTZero, ZeroGPT and OpenAI (the creator of ChatGPT). During the summer of 2023, Quill and OpenAI both decommissioned their free AI checkers because of inaccuracies. Open AI’s website says it’s planning to launch a new one.

“We have taken down AI Writing Check,” Quill.org wrote on its website, “because the new versions of Generative AI tools are too sophisticated for detection by AI.” 

The site blamed newer generative AI tools that have come out since ChatGPT launched last year.  For example, Undetectable AI promises to turn any AI-generated essay into one that can evade detectors … for a fee. 

Quill recommends a clever workaround: check students’ Google doc version history, which Google captures and saves every few minutes. A normal document history should show every typo and sentence change as a student is writing. But someone who had an essay written for them – either by a robot or a ghostwriter – will simply copy and paste the entire essay at once into a blank screen. “No human writes that way,” the Quill site says. A more detailed explanation of how to check a document’s version history is here

Checking revision histories might be more effective, but this level of detective work is ridiculously time consuming for a high school English teacher who is grading dozens of essays. AI was supposed to save us time, but right now, it’s adding to the workload of time-pressed teachers!

This story about ChatGPT detectors was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters. 

The post PROOF POINTS: It’s easy to fool ChatGPT detectors appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-its-easy-to-fool-chatgpt-detectors/feed/ 0 95538
PROOF POINTS: A smarter robo-grader https://hechingerreport.org/proof-points-a-smarter-robo-grader/ https://hechingerreport.org/proof-points-a-smarter-robo-grader/#respond Mon, 21 Feb 2022 11:00:00 +0000 https://hechingerreport.org/?p=85171

The best kind of expertise might be personal experience. When the research arm of the U.S. Department of Education wanted to learn more about the latest advances in robo-grading, it decided to hold a competition. In the fall of 2021, 23 teams, many of them Ph.D. computer scientists from universities and corporate research laboratories, competed […]

The post PROOF POINTS: A smarter robo-grader appeared first on The Hechinger Report.

]]>

The best kind of expertise might be personal experience.

When the research arm of the U.S. Department of Education wanted to learn more about the latest advances in robo-grading, it decided to hold a competition. In the fall of 2021, 23 teams, many of them Ph.D. computer scientists from universities and corporate research laboratories, competed to see who could build the best automatic scoring model.

One of the six finalists was a team of just two 2021 graduates from the Georgia Institute of Technology. Prathic Sundararajan, 21, and Suraj Rajendran, 22, met during an introductory biomedical engineering class freshman year and had studied artificial intelligence. To ward off boredom and isolation during the pandemic, they entered a half dozen hackathons and competitions, using their knowhow in machine learning to solve problems in prisons, medicine and auto sales. They kept winning. 

“We hadn’t done anything in the space of education,” said Sundararajan, who noticed an education competition on the Challenge.Gov website. “And we’ve all suffered through SATs and those standardized tests. So we were like, Okay, this will be fun. We’ll see what’s under the hood, how do they actually do it on the other side?”

The Institute of Education Sciences gave contestants 20 question items from the 2017 National Assessment of Educational Progress (NAEP), a test administered to fourth and eighth graders to track student achievement across the nation. About half the questions on the reading test were open-response, instead of multiple choice, and humans scored these sentences.

Rajendran, now a Ph.D. student at Weill Cornell Medicine in New York, thought he might be able to re-use a model he had built for medical records that used natural language processing to decipher doctors’ notes and predict patient diagnoses. That model relied on a natural language processing model called GloVe, developed by scientists at Stanford University.

Together the duo built 20 separate models, one for each open-response question. First they trained their models by having them digest the scores that humans had given to thousands of student responses on these exact same questions. One, for example, was:  “Describe two ways that people care for the violin at the museum that show the violin is valuable.” 

When they tested their robo-graders, the accuracy was poor.

“The education context is different,” said Sundararajan. Words can have different meanings in different contexts and the algorithms weren’t picking that up.

Sundararajan and Rajendran went back to the drawing board to look for other language models. They happened upon BERT.

BERT is a natural language processing model developed at Google in 2018 (yes, they found it by Googling).  It’s what Google uses for search queries but the company shares the model as free, open-source code. Sundararajan and Rajendran also found another model called RoBERTa, a modified version of BERT, that they thought might be better. But they ran out of time before submissions were due on Nov. 28, 2021.

When the prizes were announced on Jan. 21, it turned out that all the winners had selected BERT, too. The technology is a sea change in natural language processing. Think of it like the new mRNA technology that has revolutionized vaccines.  Much the way Moderna and Pfizer achieved similar efficacy rates with their COVID vaccines, the robo-grading results of the BERT users rose to the top. 

“We got extremely high levels of accuracy,” said John Whitmer, a senior fellow with the Federation of American Scientists serving at the Institute of Education Sciences. “With the top three, we had this very nice problem that they were so close, as one of our analysts said, you couldn’t really fit a piece of paper between them.”

Essays and open responses are notoriously difficult to score because there are infinite ways to write and even great writers might prefer one style over another. Two well-trained humans typically agreed on a student’s writing score on the NAEP test 90.5 percent of the time. The best robo-grader in this competition, produced by a team from Durham, N.C-based testing company, Measurement Inc., agreed with human judgment 88.8 percent of the time, only a 1.7 percentage point greater discrepancy than among humans.  

Sundararajan and Rajendran’s model was in agreement with the humans 86.1 percent of the time, just 4 percentage points shy of replicating human scores. That earned them a runner-up prize of $1,250. The top three winners each received $15,000.

The older generation of robo-grading models tended to focus on specific “features” that we value in writing, such as coherence, vocabulary, punctuation or sentence length. It was easy to game these grading systems by writing gibberish that happened to meet the criteria that the robo-grader was looking for. But a 2014 study found that these “feature” models worked reasonably well. 

BERT is much more accurate. However, its drawback is that it’s like a black box to laypeople. With the feature models, you could see that an essay scored lower because it didn’t have good punctuation, for example. With BERT models, there’s no information on why the essay scored the way it did.  

“If you try to understand how that works, you’ve got to go back and look at the billions of relationships that are made and the billions of inputs in these neural networks,” said Whitmer. 

That makes the model useful for scoring an exam, but not useful for teachers in grading school assignments because it cannot give students any concrete feedback on how to improve their writing.

BERT models were also bad at building a robo-grader that could grade more than a single question. As part of the competition, contestants were asked to build a “generic” model that could score open responses to any question. But the best of these generic models were only able to replicate human scoring half the time. It was not a success.

The upside is that humans are not going away. At least 2,000 to 5,000 human scores are needed to train an automated scoring model for each open-response question, according to Pearson, which has been using automated scoring since 1998. In this competition, contestants had 20,000 human scores to train their models. The time and cost savings kick in when test questions are re-used in subsequent years. The Department of Education currently requires humans to score student writing and it held this competition to help decide whether to adopt automated scoring on future administrations of the NAEP test.

Bias remains a concern with all machine learning models. The Institute for Education Science confirmed that Black and Hispanic students weren’t faring any worse with the algorithms than they were with human scores in this competition. The goal was to replicate the human scores, which could still be influenced by human biases. Race, ethnicity and gender aren’t known to human scorers on standardized exams, but it’s certainly possible to make assumptions based on word choices and syntax. By training the computerized models with fallible humans, we could be baking biases into the robo-graders.

Sundararajan graduated in December 2021 and is now working on blood pressure waveforms at a medical technology startup in California. After conquering educational assessment, he and Rajendran turned their attention to other timely challenges. This month, they won first place in a competition run by the Centers for Disease Control. They analyzed millions of tweets to see who had suffered trauma in their past and if their Twitter community is now serving as a helpful support group or a destructive influence.

This story about robo-grading was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for the Hechinger newsletter.

The post PROOF POINTS: A smarter robo-grader appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-a-smarter-robo-grader/feed/ 0 85171
PROOF POINTS: College Board’s own research at odds with its decision to axe the essay portion of the SAT https://hechingerreport.org/college-board-research-at-odds-with-axing-sat-essay/ https://hechingerreport.org/college-board-research-at-odds-with-axing-sat-essay/#comments Mon, 25 Jan 2021 11:00:00 +0000 https://hechingerreport.org/?p=76722 SAT essay

The College Board’s decision on Jan. 19, 2021 to eliminate the essay portion of the SAT may have delighted millions of current and future high school students but its discontinuation could also be a loss for students of color and those whose primary language isn’t English. “The essay may actually have been particularly helpful for […]

The post PROOF POINTS: College Board’s own research at odds with its decision to axe the essay portion of the SAT appeared first on The Hechinger Report.

]]>
SAT essay

The College Board’s decision on Jan. 19, 2021 to eliminate the essay portion of the SAT may have delighted millions of current and future high school students but its discontinuation could also be a loss for students of color and those whose primary language isn’t English.

“The essay may actually have been particularly helpful for predicting the college success of disadvantaged students,” said Jack Buckley, a former head of research at the College Board, via email. “Ironic that, at a time when standardized testing is under immense pressure not only due to the pandemic but also from the anti-racist movement, CB [College Board] would discontinue a feature of their flagship college entrance examination that their own research argues helped level the playing field.” 

Buckley’s opinion carries special weight because he helped lead the 2016 redesign of the SAT. He was also the commissioner of the National Center for Education Statistics in the Department of Education for three years during the Obama Administration. He is currently the head of assessment and learning sciences at Roblox, an online gaming company.

I was surprised that the essay, an optional part of the SAT test, can be an asset on the college application for many disadvantaged students because these students often struggle with writing. However, a 2019 study by the College Board, the organization that owns and earns revenue from the SAT, detailed how the essay portion increased the ability to predict how well an applicant would do in college English and writing classes. For some groups of students  — those whose best language was not English alone and those who identified as Black, Latino, Asian or multiracial — the essay score improved the ability to predict how well the applicant would do in college English and writing classes by more than 30 percent above knowing only the student’s high school grades and verbal SAT score (currently called Evidence-based Reading and Writing or ERW). 

It’s a long-running debate whether standardized exams, such as the SAT, provide useful information about a college applicant. A consistent body of research has found that high school grades are more predictive of how a student will do in college. This research has found that students with high grades and low test scores tend to succeed in college more than students with low grades and high test scores. More disadvantaged students might be admitted to selective colleges and succeed there if test scores were ignored or minimized. 

In response, testing companies have conducted research to prove that the combination of grades and test scores is important by calculating that the two together are best at predicting first year college performance. This 2019 College Board technical report, “Validity of SAT® Essay Scores for Predicting First-Year Grades,” went one step further, adding the essay to the mix of high school grades and verbal SAT test scores and comparing the three with students’ subsequent college grades. 

The College Board researchers tracked more than 180,000 students who graduated high school in 2017 and entered a four-year college that fall. They found that the essay scores and verbal test scores can diverge. Students with verbal test scores of 400 or lower who had above average essay scores could still have a greater than 80 percent chance of passing a college English or writing class. An admissions officer might reject that student based on test scores alone. However, the College Board report emphasized the opposite scenario, finding that, for example, a student with an average 500 verbal score but a very low essay score would only have a 68 percent probability of passing college English. 

“As the nation’s student population becomes more diverse,” the College Board researchers wrote, “institutions may find that the SAT Essay scores add even greater value over time, particularly for identifying students who may struggle with the writing skills needed to be successful in college.”

Despite the testing industry’s efforts to prove the usefulness of their tests, colleges have moved away from requiring them. Beginning with the University of Chicago in 2018, the test optional movement has accelerated during the pandemic and has now ballooned to more than 1,000 colleges and universities. 

The College Board may be nervous about the future of its flagship product. In a statement, the organization explained that it was eliminating the essay along with 20 separate subject tests to “reduce and simplify” demands on students. It added that there are other ways for students to demonstrate their writing ability and that the reading and writing test that lives on is still among the “most predictive” parts of the SAT.

The SAT first added an essay in 2005 because college admissions officers wanted to see authentic writing samples from students. In 2016, the free response essay was revamped into a written analysis of a text, akin to a college assignment, and made optional

Meanwhile, the College Board’s business model was changing as it won lucrative statewide contracts to test all students at school. The optional essay portion was generally not included in these free in-school SAT tests. Students who wanted to apply to a college that required or recommended the essay still had to register and pay for a weekend test. 

Many argued that the hassle of getting to a testing center and applying for a fee waiver was burdensome for low-income students — the opposite of expanding college access. Colleges started dropping the essay, even before dropping the SAT entirely. Still, more than half, or 57 percent, of the 2.2 million students who took the SAT last year opted to complete the essay portion. 

The essay is an expensive part of the exam to administer, far more laborious than scanning a bubble sheet from a multiple choice test. Each hand-written essay is read by two human graders and given a numerical mark in each of three categories, for a total of six marks per essay. The College Board charges only $16 more for the optional essay portion. When I questioned whether the essay portion was a money loser for College Board, spokesman Jerome White replied by email that the essay has a “positive economic impact” and the decision to discontinue it is a “mission-based decision.”

It’s not often that I feel sympathy for the College Board. But in this case, the organization created an essay to respond to the demands of college admissions officers, tried to improve the essay to make it more like a college writing assignment, then found evidence that the essay was actually most useful for assessing disadvantaged students. But apparently, business realities and anti-testing sentiment got in the way. 

This story about the SAT essay was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for the Hechinger newsletter.

The post PROOF POINTS: College Board’s own research at odds with its decision to axe the essay portion of the SAT appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/college-board-research-at-odds-with-axing-sat-essay/feed/ 1 76722
PROOF POINTS: Evidence increases for writing during math class https://hechingerreport.org/proof-points-evidence-increases-for-writing-during-math-class/ https://hechingerreport.org/proof-points-evidence-increases-for-writing-during-math-class/#comments Mon, 16 Nov 2020 11:00:00 +0000 https://hechingerreport.org/?p=75361 writing to learn

Essay writing and math class might seem like oil and water, two things that don’t mix easily. But there’s increasing evidence that students who are asked to write about what they are learning master the material better —  even in number-filled subjects like math and science. Education experts call it “writing to learn,” in contrast […]

The post PROOF POINTS: Evidence increases for writing during math class appeared first on The Hechinger Report.

]]>
writing to learn

Essay writing and math class might seem like oil and water, two things that don’t mix easily. But there’s increasing evidence that students who are asked to write about what they are learning master the material better —  even in number-filled subjects like math and science.

Education experts call it “writing to learn,” in contrast to “learning to write,” which is usually taught in an English class. The theory is decades old. The act of writing clarifies thoughts and improves understanding, similar to talking over an idea with a friend. Even the inability to write a sentence can be a sign of confusion, often prompting a student to dig deeper. Putting pencil to paper also creates and reinforces memory, helping a student to recall information later during a test.  

Many experiments have documented the power of writing outside of English classes but others haven’t found it to be so beneficial. Some skeptics wonder if minutes spent writing takes precious class time away from learning topics the traditional way by, for instance, reading, listening, taking notes, and completing worksheets and projects. Steve Graham, a national expert in writing research at Arizona State University, along with two researchers at the University of Utah, decided to review all the studies and found that the writing-to-learn theory is solid. 

“As predicted, writing about content reliably enhanced learning,” the authors wrote. 

Learning gains from writing were nearly identical in math, science and social studies. (For each subject, students gained about 0.3 of a standard deviation, on average, which is a statistical unit that’s hard to translate, but it’s generally considered a medium boost.) In other words, writing tends to be moderately helpful regardless of what subject you’re learning. 

The study, “The Effects of Writing on Learning in Science, Social Studies, and Mathematics: A Meta-Analysis,” was published in the April 2020 issue of the peer-reviewed journal, Review of Educational Research.

The authors combed the research literature for all the studies they could find on writing outside of English class and found 56 high-quality experiments involving more than 6,000 students from elementary to high school. Some of the underlying studies were small, involving as few as 20 students, but others were larger, involving hundreds. 

In each case, students were assigned writing work during their math, science or social studies classes and compared to students who weren’t given writing assignments. Gains were measured by giving students math, science and social studies tests before and after the writing intervention. Both the writers and the non-writers were given the same amount of instructional time and exposure to class content. 

Despite the average benefit found across the 56 studies, individual studies came to wildly different conclusions. At the high end, one study found that writing was extremely beneficial, more than double the average learning improvement. Yet nearly one in five of the studies found that writing was harmful; students who were given writing assignments learned less than students who were taught traditionally.

The negative finding for 18 percent of the studies troubled the researchers. But they couldn’t find an easy explanation for why writing works so well sometimes and horribly other times. The researchers checked to see if certain categories of writing assignments backfired but didn’t find that to be the case. Writing summaries, research reports, arguments, self-reflective journal entries and even narrative stories all seemed to work most of the time. The researchers wondered if some teachers aren’t designing good writing activities for their students and they advised schools to monitor student progress to avoid this known but puzzling pitfall. 

There were some clues that argumentative or persuasive writing might be the most powerful in boosting knowledge but only seven studies in the meta-analysis used this type of assignment. Journal entries were the most popular type of writing assignment, especially in math classes.

The intent here isn’t to boost writing skills but rather math, science and social studies content knowledge. Indeed, teachers often don’t give any writing feedback on these assignments and react only to the ideas that the student is presenting. Some writing-to-learn advocates actively discourage any comment on punctuation, spelling or grammar so as not to demoralize kids who don’t enjoy writing  

At the same time, more writing assignments would also be welcome for writing’s sake. “This is one to increase how much writing students engage in for real purposes at school,” Graham explained by email, rather than a writing assignment that feels pointless. 

Even in English classes, U.S. schools have long emphasized reading at the expense of writing, Graham has documented. Maybe we need a math argument to bring writing back.

This story about writing to learn was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for the Hechinger newsletter.

The post PROOF POINTS: Evidence increases for writing during math class appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-evidence-increases-for-writing-during-math-class/feed/ 1 75361
Scientific evidence on how to teach writing is slim https://hechingerreport.org/scientific-evidence-on-how-to-teach-writing-is-slim/ https://hechingerreport.org/scientific-evidence-on-how-to-teach-writing-is-slim/#respond Mon, 04 Nov 2019 11:00:07 +0000 https://hechingerreport.org/?p=58695 writing instruction

The poor quality of student writing is a common lament among college professors. But how are elementary, middle and high school teachers supposed to teach it better? Unfortunately, this is an area where education research doesn’t offer educators clear advice. “What’s very odd about writing is how small the research base is,” said Robert Slavin, […]

The post Scientific evidence on how to teach writing is slim appeared first on The Hechinger Report.

]]>
writing instruction
writing instruction
A writing assignment from a middle school classroom. Credit: Jackie Mader/The Hechinger Report

The poor quality of student writing is a common lament among college professors. But how are elementary, middle and high school teachers supposed to teach it better? Unfortunately, this is an area where education research doesn’t offer educators clear advice.

“What’s very odd about writing is how small the research base is,” said Robert Slavin, director of the Center for Research and Reform in Education at the Johns Hopkins School of Education. “There’s remarkably very little high-quality evidence of what works in writing.”

Compared to subjects such as math and reading, the amount of research on how to teach writing is tiny. Earlier in 2019, Slavin searched for rigorous research on teaching writing from second grade to high school. He and his Johns Hopkins colleagues, along with researchers in Belgium and the United Kingdom, found only 14 studies that met their standards. By contrast, he quickly found 69 studies just on teaching reading to high school students.

To meet Slavin’s standards, researchers had to compare how much students improved in writing with students who weren’t exposed to a particular writing method and were taught to write as usual. Most studies on writing instruction didn’t have control groups like this. Without them, we can’t know if a new writing approach is better than an old one. Slavin also required that student writing be evaluated objectively. Writing tests could not be made up by the creators or marketers of the curriculum and they could not be scored by the teachers who taught the classes.

Many popular writing programs used in schools around the country, such as Writer’s Workshop or the Hochman Method, haven’t been put to the test like this. These might both be excellent teaching methods but there are no controlled studies of their effectiveness. However, a large scientific study of Writer’s Workshop is underway and results are expected in 2021. (A preliminary report came out in October 2019.) The organization that markets the Hochman Method, a much newer curriculum, told me it hopes to conduct rigorous research in the future.

The 14 studies looking at 12 different writing programs were described in Slavin’s 2019 review, “A Quantitative Synthesis of Research on Writing Approaches in Grades 2 to 12.” Some of the approaches focused on explicitly teaching the writing process from planning to drafting to revising. Others emphasized working with classmates and making writing a communal activity. Another approach was to integrate reading with the writing.

It turns out all three approaches worked some of the time but none of the approaches clearly outshone the others. There was also a lot of overlap among the three approaches. For example, one writing program both explicitly taught the writing process and had students edit each other’s work communally.

The husband-and-wife team of Steve Graham and Karen Harris, both professors at Arizona State University, dominate the research literature on writing instruction. Many of their theories were developed with students with disabilities and were rigorously tested on a broader student population, not here in the United States, but in England. In one study, their methods of explicit writing instruction worked spectacularly. In a bigger second study, their method didn’t show writing improvements for students. That’s the way education is. Not every idea works every time.

Related: Three lessons from the science of how to teach writing

One broad lesson that emerges from the 12 tested programs was that students benefit from step-by-step guides to writing in various genres. Argumentative writing, for example, is very different from fiction writing. The What Works Clearinghouse, a federal government website of scientifically proven ideas in education, also highlights the importance of explicit writing instruction that varies by genre for both elementary and high school students.

Another lesson is that students also need explicit grammar and punctuation instruction but it should be taught in the context of their writing, not as a separate stand-alone lesson.

Beyond a well-structured writing course, Slavin and his colleagues noted that in these studies of writing, the classes were “exciting, social and noisy.”

“Motivation seems to be the key,” Slavin and his colleagues wrote. “If students love to write, because their peers as well as their teachers are eager to see what they have to say, then they will write with energy and pleasure. Perhaps more than any other subject, writing demands a supportive environment, in which students want to become better writers because they love the opportunity to express themselves, and to interact in writing with valued peers and teachers.”

I was wondering, as I was reading this review, if nearly every thoughtful writing curriculum is likely to produce results because it’s making kids write more than they currently are. In this country, pressure to score well on reading and math tests has pushed writing instruction down the priority list so there isn’t a lot of time spent on writing instruction.

I’d like to see some good studies on dosage. How much should kids be writing every day or every week to become respectable writers when they enter the college gates?

This story about writing instruction was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for the Hechinger newsletter.

The post Scientific evidence on how to teach writing is slim appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/scientific-evidence-on-how-to-teach-writing-is-slim/feed/ 0 58695