Jill Barshay Archives - The Hechinger Report https://hechingerreport.org/category/columnists/jill-barshay/ Covering Innovation & Inequality in Education Wed, 03 Jul 2024 11:37:54 +0000 en-US hourly 1 https://hechingerreport.org/wp-content/uploads/2018/06/cropped-favicon-32x32.jpg Jill Barshay Archives - The Hechinger Report https://hechingerreport.org/category/columnists/jill-barshay/ 32 32 138677242 PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why https://hechingerreport.org/proof-points-asian-american-ai-bias/ https://hechingerreport.org/proof-points-asian-american-ai-bias/#comments Mon, 08 Jul 2024 10:00:00 +0000 https://hechingerreport.org/?p=101830 global online academy

When ChatGPT was released to the public in November 2022, advocates and watchdogs warned about the potential for racial bias. The new large language model was created by harvesting 300 billion words from books, articles and online writing, which include racist falsehoods and reflect writers’ implicit biases. Biased training data is likely to generate biased […]

The post PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why appeared first on The Hechinger Report.

]]>
global online academy

When ChatGPT was released to the public in November 2022, advocates and watchdogs warned about the potential for racial bias. The new large language model was created by harvesting 300 billion words from books, articles and online writing, which include racist falsehoods and reflect writers’ implicit biases. Biased training data is likely to generate biased advice, answers and essays. Garbage in, garbage out. 

Researchers are starting to document how AI bias manifests in unexpected ways. Inside the research and development arm of the giant testing organization ETS, which administers the SAT, a pair of investigators pitted man against machine in evaluating more than 13,000 essays written by students in grades 8 to 12. They discovered that the AI model that powers ChatGPT penalized Asian American students more than other races and ethnicities in grading the essays. This was purely a research exercise and these essays and machine scores weren’t used in any of ETS’s assessments. But the organization shared its analysis with me to warn schools and teachers about the potential for racial bias when using ChatGPT or other AI apps in the classroom.

AI and humans scored essays differently by race and ethnicity

“Diff” is the difference between the average score given by humans and GPT-4o in this experiment. “Adj. Diff” adjusts this raw number for the randomness of human ratings. Source: Table from Matt Johnson & Mo Zhang “Using GPT-4o to Score Persuade 2.0 Independent Items” ETS (June 2024 draft)

“Take a little bit of caution and do some evaluation of the scores before presenting them to students,” said Mo Zhang, one of the ETS researchers who conducted the analysis. “There are methods for doing this and you don’t want to take people who specialize in educational measurement out of the equation.”

That might sound self-serving for an employee of a company that specializes in educational measurement. But Zhang’s advice is worth heeding in the excitement to try new AI technology. There are potential dangers as teachers save time by offloading grading work to a robot.

In ETS’s analysis, Zhang and her colleague Matt Johnson fed 13,121 essays into one of the latest versions of the AI model that powers ChatGPT, called GPT 4 Omni or simply GPT-4o. (This version was added to ChatGPT in May 2024, but when the researchers conducted this experiment they used the latest AI model through a different portal.)  

A little background about this large bundle of essays: students across the nation had originally written these essays between 2015 and 2019 as part of state standardized exams or classroom assessments. Their assignment had been to write an argumentative essay, such as “Should students be allowed to use cell phones in school?” The essays were collected to help scientists develop and test automated writing evaluation.

Each of the essays had been graded by expert raters of writing on a 1-to-6 point scale with 6 being the highest score. ETS asked GPT-4o to score them on the same six-point scale using the same scoring guide that the humans used. Neither man nor machine was told the race or ethnicity of the student, but researchers could see students’ demographic information in the datasets that accompany these essays.

GPT-4o marked the essays almost a point lower than the humans did. The average score across the 13,121 essays was 2.8 for GPT-4o and 3.7 for the humans. But Asian Americans were docked by an additional quarter point. Human evaluators gave Asian Americans a 4.3, on average, while GPT-4o gave them only a 3.2 – roughly a 1.1 point deduction. By contrast, the score difference between humans and GPT-4o was only about 0.9 points for white, Black and Hispanic students. Imagine an ice cream truck that kept shaving off an extra quarter scoop only from the cones of Asian American kids. 

“Clearly, this doesn’t seem fair,” wrote Johnson and Zhang in an unpublished report they shared with me. Though the extra penalty for Asian Americans wasn’t terribly large, they said, it’s substantial enough that it shouldn’t be ignored. 

The researchers don’t know why GPT-4o issued lower grades than humans, and why it gave an extra penalty to Asian Americans. Zhang and Johnson described the AI system as a “huge black box” of algorithms that operate in ways “not fully understood by their own developers.” That inability to explain a student’s grade on a writing assignment makes the systems especially frustrating to use in schools.

This table compares GPT-4o scores with human scores on the same batch of 13,121 student essays, which were scored on a 1-to-6 scale. Numbers highlighted in green show exact score matches between GPT-4o and humans. Unhighlighted numbers show discrepancies. For example, there were 1,221 essays where humans awarded a 5 and GPT awarded 3. Data source: Matt Johnson & Mo Zhang “Using GPT-4o to Score Persuade 2.0 Independent Items” ETS (June 2024 draft)

This one study isn’t proof that AI is consistently underrating essays or biased against Asian Americans. Other versions of AI sometimes produce different results. A separate analysis of essay scoring by researchers from University of California, Irvine and Arizona State University found that AI essay grades were just as frequently too high as they were too low. That study, which used the 3.5 version of ChatGPT, did not scrutinize results by race and ethnicity.

I wondered if AI bias against Asian Americans was somehow connected to high achievement. Just as Asian Americans tend to score high on math and reading tests, Asian Americans, on average, were the strongest writers in this bundle of 13,000 essays. Even with the penalty, Asian Americans still had the highest essay scores, well above those of white, Black, Hispanic, Native American or multi-racial students. 

In both the ETS and UC-ASU essay studies, AI awarded far fewer perfect scores than humans did. For example, in this ETS study, humans awarded 732 perfect 6s, while GPT-4o gave out a grand total of only three. GPT’s stinginess with perfect scores might have affected a lot of Asian Americans who had received 6s from human raters.

ETS’s researchers had asked GPT-4o to score the essays cold, without showing the chatbot any graded examples to calibrate its scores. It’s possible that a few sample essays or small tweaks to the grading instructions, or prompts, given to ChatGPT could reduce or eliminate the bias against Asian Americans. Perhaps the robot would be fairer to Asian Americans if it were explicitly prompted to “give out more perfect 6s.” 

The ETS researchers told me this wasn’t the first time that they’ve noticed Asian students treated differently by a robo-grader. Older automated essay graders, which used different algorithms, have sometimes done the opposite, giving Asians higher marks than human raters did. For example, an ETS automated scoring system developed more than a decade ago, called e-rater, tended to inflate scores for students from Korea, China, Taiwan and Hong Kong on their essays for the Test of English as a Foreign Language (TOEFL), according to a study published in 2012. That may have been because some Asian students had memorized well-structured paragraphs, while humans easily noticed that the essays were off-topic. (The ETS website says it only relies on the e-rater score alone for practice tests, and uses it in conjunction with human scores for actual exams.) 

Asian Americans also garnered higher marks from an automated scoring system created during a coding competition in 2021 and powered by BERT, which had been the most advanced algorithm before the current generation of large language models, such as GPT. Computer scientists put their experimental robo-grader through a series of tests and discovered that it gave higher scores than humans did to Asian Americans’ open-response answers on a reading comprehension test. 

It was also unclear why BERT sometimes treated Asian Americans differently. But it illustrates how important it is to test these systems before we unleash them in schools. Based on educator enthusiasm, however, I fear this train has already left the station. In recent webinars, I’ve seen many teachers post in the chat window that they’re already using ChatGPT, Claude and other AI-powered apps to grade writing. That might be a time saver for teachers, but it could also be harming students. 

This story about AI bias was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: Asian American students lose more points in an AI essay grading study — but researchers don’t know why appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-asian-american-ai-bias/feed/ 3 101830
PROOF POINTS: Some of the $190 billion in pandemic money for schools actually paid off https://hechingerreport.org/proof-points-190-billion-question-partially-answered/ https://hechingerreport.org/proof-points-190-billion-question-partially-answered/#respond Mon, 01 Jul 2024 10:00:00 +0000 https://hechingerreport.org/?p=101767 This image shows a conceptual illustration with a figure standing amidst a variety of floating U.S. dollar bill fragments on a teal background. The pieces of currency are scattered in different orientations, creating a sense of disarray and abstraction.

Reports about schools squandering their $190 billion in federal pandemic recovery money have been troubling.  Many districts spent that money on things that had nothing to do with academics, particularly building renovations. Less common, but more eye-popping were stories about new football fields, swimming pool passes, hotel rooms at Caesar’s Palace in Las Vegas and […]

The post PROOF POINTS: Some of the $190 billion in pandemic money for schools actually paid off appeared first on The Hechinger Report.

]]>
This image shows a conceptual illustration with a figure standing amidst a variety of floating U.S. dollar bill fragments on a teal background. The pieces of currency are scattered in different orientations, creating a sense of disarray and abstraction.

Reports about schools squandering their $190 billion in federal pandemic recovery money have been troubling.  Many districts spent that money on things that had nothing to do with academics, particularly building renovations. Less common, but more eye-popping were stories about new football fields, swimming pool passes, hotel rooms at Caesar’s Palace in Las Vegas and even the purchase of an ice cream truck. 

So I was surprised that two independent academic analyses released in June 2024 found that some of the money actually trickled down to students and helped them catch up academically.  Though the two studies used different methods, they arrived at strikingly similar numbers for the average growth in math and reading scores during the 2022-23 school year that could be attributed to each dollar of federal aid. 

One of the research teams, which includes Harvard University economist Tom Kane and Stanford University sociologist Sean Reardon, likened the gains to six days of learning in math and three days of learning in reading for every $1,000 in federal pandemic aid per student. Though that gain might seem small, high-poverty districts received an average of $7,700 per student, and those extra “days” of learning for low-income students added up. Still, these neediest children were projected to be one third of a grade level behind low-income students in 2019, before the pandemic disrupted education.

“Federal funding helped and it helped kids most in need,” wrote Robin Lake, director of the Center on Reinventing Public Education, on X in response to the two studies. Lake was not involved in either report, but has been closely tracking pandemic recovery. “And the spending was worth the gains,” Lake added. “But it will not be enough to do all that is needed.” 

The academic gains per aid dollar were close to what previous researchers had found for increases in school spending. In other words, federal pandemic aid for schools has been just as effective (or ineffective) as other infusions of money for schools. The Harvard-Stanford analysis calculated that the seemingly small academic gains per $1,000 could boost a student’s lifetime earnings by $1,238 – not a dramatic payoff, but not a public policy bust either. And that payoff doesn’t include other societal benefits from higher academic achievement, such as lower rates of arrests and teen motherhood. 

The most interesting nuggets from the two reports, however, were how the academic gains varied wildly across the nation. That’s not only because some schools used the money more effectively than others but also because some schools got much more aid per student.

The poorest districts in the nation, where 80 percent or more of the students live in families whose income is low enough to qualify for the federally funded school lunch program, demonstrated meaningful recovery because they received the most aid. About 6 percent of the 26 million public schoolchildren that the researchers studied are educated in districts this poor. These children had recovered almost half of their pandemic learning losses by the spring of 2023. The very poorest districts, representing 1 percent of the children, were potentially on track for an almost complete recovery in 2024 because they tended to receive the most aid per student. However, these students were far below grade level before the pandemic, so their recovery brings them back to a very low rung.

Some high-poverty school districts received much more aid per student than others. At the top end of the range, students in Detroit received about $26,000 each – $1.3 billion spread among fewer than 49,000 students. One in 10 high-poverty districts received more than $10,700 for each student. An equal number of high-poverty districts received less than $3,700 per student. These surprising differences for places with similar poverty levels occurred because pandemic aid was allocated according to the same byzantine rules that govern federal Title I funding to low-income schools. Those formulas give large minimum grants to small states, and more money to states that spend more per student. 

On the other end of the income spectrum are wealthier districts, where 30 percent or fewer students qualify for the lunch program, representing about a quarter of U.S. children. The Harvard-Stanford researchers expect these students to make an almost complete recovery. That’s not because of federal recovery funds; these districts received less than $1,000 per student, on average. Researchers explained that these students are on track to approach 2019 achievement levels because they didn’t suffer as much learning loss.  Wealthier families also had the means to hire tutors or time to help their children at home.

Middle-income districts, where between 30 percent and 80 percent of students are eligible for the lunch program, were caught in between. Roughly seven out of 10 children in this study fall into this category. Their learning losses were sometimes large, but their pandemic aid wasn’t. They tended to receive between $1,000 and $5,000 per student. Many of these students are still struggling to catch up.

In the second study, researchers Dan Goldhaber of the American Institutes for Research and Grace Falken of the University of Washington estimated that schools around the country, on average, would need an additional $13,000 per student for full recovery in reading and math.  That’s more than Congress appropriated.

There were signs that schools targeted interventions to their neediest students. In school districts that separately reported performance for low-income students, these students tended to post greater recovery per dollar of aid than wealthier students, the Goldhaber-Falken analysis shows.

Impact differed more by race, location and school spending. Districts with larger shares of white students tended to make greater achievement gains per dollar of federal aid than districts with larger shares of Black or Hispanic students. Small towns tended to produce more academic gains per dollar of aid than large cities. And school districts that spend less on education per pupil tended to see more academic gains per dollar of aid than high spenders. The latter makes sense: an extra dollar to a small budget makes a bigger difference than an extra dollar to a large budget. 

The most frustrating part of both reports is that we have no idea what schools did to help students catch up. Researchers weren’t able to connect the academic gains to tutoring, summer school or any of the other interventions that schools have been trying. Schools still have until September to decide how to spend their remaining pandemic recovery funds, and, unfortunately, these analyses provide zero guidance.

And maybe some of the non-academic things that schools spent money on weren’t so frivolous after all. A draft paper circulated by the National Bureau of Economic Research in January 2024 calculated that school spending on basic infrastructure, such as air conditioning and heating systems, raised test scores. Spending on athletic facilities did not. 

Meanwhile, the final score on pandemic recovery for students is still to come. I’ll be looking out for it.

This story about federal funding for education was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: Some of the $190 billion in pandemic money for schools actually paid off appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-190-billion-question-partially-answered/feed/ 0 101767
PROOF POINTS: This is your brain. This is your brain on screens https://hechingerreport.org/proof-points-neuroscience-paper-v-screens-reading/ https://hechingerreport.org/proof-points-neuroscience-paper-v-screens-reading/#comments Mon, 24 Jun 2024 10:00:00 +0000 https://hechingerreport.org/?p=101581

Studies show that students of all ages, from elementary school to college, tend to absorb more when they’re reading on paper rather than screens. The advantage for paper is a small one, but it’s been replicated in dozens of laboratory experiments, particularly when students are reading about science or other nonfiction texts. Experts debate why […]

The post PROOF POINTS: This is your brain. This is your brain on screens appeared first on The Hechinger Report.

]]>
One brain study, published in May 2024, detected different electrical activity in the brain after students had read a passage on paper, compared with screens. Credit: Getty Images

Studies show that students of all ages, from elementary school to college, tend to absorb more when they’re reading on paper rather than screens. The advantage for paper is a small one, but it’s been replicated in dozens of laboratory experiments, particularly when students are reading about science or other nonfiction texts.

Experts debate why comprehension is worse on screens. Some think the glare and flicker of screens tax the brain more than ink on paper. Others conjecture that students have a tendency to skim online but read with more attention and effort on paper. Digital distraction is an obvious downside to screens. But internet browsing, texting or TikTok breaks aren’t allowed in the controlled conditions of these laboratory studies.

Neuroscientists around the world are trying to peer inside the brain to solve the mystery. Recent studies have begun to document salient differences in brain activity when reading on paper versus screens. None of the studies I discuss below is definitive or perfect, but together they raise interesting questions for future researchers to explore. 

One Korean research team documented that young adults had lower concentrations of oxygenated hemoglobin in a section of the brain called the prefrontal cortex when reading on paper compared with screens. The prefrontal cortex is associated with working memory and that could mean the brain is more efficient in absorbing and memorizing new information on paper, according to a study published in January 2024 in the journal Brain Sciences. An experiment in Japan, published in 2020, also noticed less blood flow in the prefrontal cortex when readers were recalling words in a passage that they had read on paper, and more blood flow with screens.

But it’s not clear what that increased blood flow means. The brain needs to be activated in order to learn and one could also argue that the extra brain activation during screen reading could be good for learning. 

Instead of looking at blood flow, a team of Israeli scientists analyzed electrical activity in the brains of 6- to 8-year-olds. When the children read on paper, there was more power in high-frequency brainwaves. When the children read from screens, there was more energy in low-frequency bands. 

The Israeli scientists interpreted these frequency differences as a sign of better concentration and attention when reading on paper. In their 2023 paper, they noted that attention difficulties and mind wandering have been associated with lower frequency bands – exactly the bands that were elevated during screen reading. However, it was a tiny study of 15 children and the researchers could not confirm whether the children’s minds were actually wandering when they were reading on screens. 

Another group of neuroscientists in New York City has also been looking at electrical activity in the brain. But instead of documenting what happens inside the brain while reading, they looked at what happens in the brain just after reading, when students are responding to questions about a text. 

The study, published in the peer-reviewed journal PLOS ONE in May 2024, was conducted by neuroscientists at Teachers College, Columbia University, where The Hechinger Report is also based. My news organization is an independent unit of the college, but I am covering this study just like I cover other educational research. 

In the study, 59 children, aged 10 to 12, read short passages, half on screens and half on paper. After reading the passage, the children were shown new words, one at a time, and asked whether they were related to the passage they had just read. The children wore stretchy hair nets embedded with electrodes. More than a hundred sensors measured electrical currents inside their brains a split second after each new word was revealed.

For most words, there was no difference in brain activity between screens and paper. There was more positive voltage when the word was obviously related to the text, such as the word “flow” after reading a passage about volcanoes. There was more negative voltage with an unrelated word like “bucket,” which the researchers said was an indication of surprise and additional brain processing. These brainwaves were similar regardless of whether the child had read the passage on paper or on screens. 

However, there were stark differences between paper and screens when it came to ambiguous words, ones where you could make a creative argument that the word was tangentially related to the reading passage or just as easily explain why it was unrelated. Take for example, the word “roar” after reading about volcanoes. Children who had read the passage on paper showed more positive voltage, just as they had for clearly related words like “flow.” Yet, those who had read the passage on screens showed more negative activity, just as they had for unrelated words like “bucket.”

For the researchers, the brainwave difference for ambiguous words was a sign that students were engaging in “deeper” reading on paper. According to this theory, the more deeply information is processed, the more associations the brain makes. The electrical activity the neuroscientists detected reveals the traces of these associations and connections. 

Despite this indication of deeper reading, the researchers didn’t detect any differences in basic comprehension. The children in this experiment did just as well on a simple comprehension test after reading a passage on paper as they did on screens. The neuroscientists told me that the comprehension test they administered was only to verify that the children had actually read the passage and wasn’t designed to detect deeper reading. I wish, however, the children had been asked to do something involving more analysis to buttress their argument that students had engaged in deeper reading on paper.

Virginia Clinton-Lisell, a reading researcher at the University of North Dakota who was not involved in this study, said she was “skeptical” of its conclusions, in part because the word-association exercise the neuroscientists created hasn’t been validated by outside researchers. Brain activation during a word association exercise may not be proof that we process language more thoroughly or deeply on paper.

One noteworthy result from this experiment is speed. Many reading experts have believed that comprehension is often worse on screens because students are skimming rather than reading. But in the controlled conditions of this laboratory experiment, there were no differences in reading speed: 57 seconds on the laptop compared to 58 seconds on paper –  statistically equivalent in a small experiment like this. And so that raises more questions about why the brain is acting differently between the two media. 

“I’m not sure why one would process some visual images more deeply than others if the subjects spent similar amounts of time looking at them,” said Timothy Shanahan, a reading research expert and a professor emeritus at the University of Illinois at Chicago. 

None of this work settles the debate over reading on screens versus paper. All of them ignore the promise of interactive features, such as glossaries and games, which can swing the advantage to electronic texts. Early research can be messy, and that’s a normal part of the scientific process. But so far, the evidence seems to be corroborating conventional reading research that something different is going on when kids log in rather than turn a page.

This story about reading on screens vs. paper was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: This is your brain. This is your brain on screens appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-neuroscience-paper-v-screens-reading/feed/ 3 101581
PROOF POINTS: Teens are looking to AI for information and answers, two surveys show https://hechingerreport.org/proof-points-teens-ai-surveys/ https://hechingerreport.org/proof-points-teens-ai-surveys/#respond Mon, 17 Jun 2024 10:00:00 +0000 https://hechingerreport.org/?p=101528

Two new surveys, both released this month, show how high school and college-age students are embracing artificial intelligence. There are some inconsistencies and many unanswered questions, but what stands out is how much teens are turning to AI for information and to ask questions, not just to do their homework for them. And they’re using […]

The post PROOF POINTS: Teens are looking to AI for information and answers, two surveys show appeared first on The Hechinger Report.

]]>

Two new surveys, both released this month, show how high school and college-age students are embracing artificial intelligence. There are some inconsistencies and many unanswered questions, but what stands out is how much teens are turning to AI for information and to ask questions, not just to do their homework for them. And they’re using it for personal reasons as well as for school. Another big takeaway is that there are different patterns by race and ethnicity with Black, Hispanic and Asian American students often adopting AI faster than white students.

The first report, released on June 3, was conducted by three nonprofit organizations, Hopelab, Common Sense Media, and the Center for Digital Thriving at the Harvard Graduate School of Education. These organizations surveyed 1,274 teens and young adults aged 14-22 across the U.S. from October to November 2023. At that time, only half the teens and young adults said they had ever used AI, with just 4 percent using it daily or almost every day. 

Emily Weinstein, executive director for the Center for Digital Thriving, a research center that investigates how youth are interacting with technology, said that more teens are “certainly” using AI now that these tools are embedded in more apps and websites, such as Google Search. Last October and November, when this survey was conducted, teens typically had to take the initiative to navigate to an AI site and create an account. An exception was Snapchat, a social media app that had already added an AI chatbot for its users. 

More than half of the early adopters said they had used AI for getting information and for brainstorming, the first and second most popular uses. This survey didn’t ask teens if they were using AI for cheating, such as prompting ChatGPT to write their papers for them. However, among the half of respondents who were already using AI, fewer than half – 46 percent – said they were using it for help with school work. The fourth most common use was for generating pictures.

The survey also asked teens a couple of open-response questions. Some teens told researchers that they are asking AI private questions that they were too embarrassed to ask their parents or their friends. “Teens are telling us I have questions that are easier to ask robots than people,”  said Weinstein.

Weinstein wants to know more about the quality and the accuracy of the answers that AI is giving teens, especially those with mental health struggles, and how privacy is being protected when students share personal information with chatbots.

The second report, released on June 11, was conducted by Impact Research and  commissioned by the Walton Family Foundation. In May 2024, Impact Research surveyed 1,003 teachers, 1,001 students aged 12-18, 1,003 college students, and 1,000 parents about their use and views of AI.

This survey, which took place six months after the Hopelab-Common Sense survey, demonstrated how quickly usage is growing. It found that 49 percent of students, aged 12-18, said they used ChatGPT at least once a week for school, up 26 percentage points since 2023. Forty-nine percent of college undergraduates also said they were using ChatGPT every week for school but there was no comparison data from 2023.

Among 12- to 18-year-olds and college students who had used AI chatbots for school, 56 percent said they had used it for help in writing essays and other writing assignments. Undergraduate students were more than twice as likely as 12- to 18-year-olds to say using AI felt like cheating, 22 percent versus 8 percent. Earlier 2023 surveys of student cheating by scholars at Stanford University did not detect an increase in cheating with ChatGPT and other generative AI tools. But as students use AI more, students’ understanding of what constitutes cheating may also be evolving. 

 

More than 60 percent of college students who used AI said they were using it to study for tests and quizzes. Half of the college students who used AI said they were using it to deepen their subject knowledge, perhaps, as if it were an online encyclopedia. There was no indication from this survey if students were checking the accuracy of the information.

Both surveys noticed differences by race and ethnicity. The first Hopelab-Common Sense survey found that 7 percent of Black students, aged 14-22, were using AI every day, compared with 5 percent of Hispanic students and 3 percent of white students. In the open-ended questions, one Black teen girl wrote that, with AI, “we can change who we are and become someone else that we want to become.” 

The Walton Foundation survey found that Hispanic and Asian American students were sometimes more likely to use AI than white and Black students, especially for personal purposes. 

These are all early snapshots that are likely to keep shifting. OpenAI is expected to become part of the Apple universe in the fall, including its iPhones, computers and iPads.  “These numbers are going to go up and they’re going to go up really fast,” said Weinstein. “Imagine that we could go back 15 years in time when social media use was just starting with teens. This feels like an opportunity for adults to pay attention.”

This story about ChatGPT in education was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: Teens are looking to AI for information and answers, two surveys show appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-teens-ai-surveys/feed/ 0 101528
PROOF POINTS: As teacher layoffs loom, research evidence mounts that seniority protections hurt kids in poverty https://hechingerreport.org/proof-points-teacher-layoffs-seniority-protections/ https://hechingerreport.org/proof-points-teacher-layoffs-seniority-protections/#comments Mon, 10 Jun 2024 10:00:00 +0000 https://hechingerreport.org/?p=101445

Teacher layoffs are likely this fall as $190 billion in federal pandemic aid expires. By one estimate, schools spent a fifth of their temporary funds on hiring new people, most of them teachers. Those jobs may soon be cut with many less experienced teachers losing their jobs first. The education world describes this policy with […]

The post PROOF POINTS: As teacher layoffs loom, research evidence mounts that seniority protections hurt kids in poverty appeared first on The Hechinger Report.

]]>

Teacher layoffs are likely this fall as $190 billion in federal pandemic aid expires. By one estimate, schools spent a fifth of their temporary funds on hiring new people, most of them teachers. Those jobs may soon be cut with many less experienced teachers losing their jobs first. The education world describes this policy with a business acronym used in inventory accounting: LIFO or “Last In, First Out.” 

Intuitively, LIFO seems smart. It not only rewards teachers for their years of service, but there’s also good evidence that teachers improve with experience. Not every seasoned teacher is great, but on average, veterans are better than rookies. Keeping them in classrooms is generally best for students.

The problem is that senior teachers aren’t evenly distributed across schools. Wealthier and whiter schools tend to have more experienced teachers. By contrast, high-poverty schools, often populated by Black and Hispanic students, are staffed by more junior teachers. That’s because stressful working conditions at low-income schools prompt many teachers to leave after a short stint. Each year, they’re replaced with a fresh crop of young teachers and the turnover repeats. 

When school districts lay teachers off by seniority, high-poverty schools end up bearing the brunt of the job cuts. The policy exacerbates the teacher churn at these schools. And that churn alone harms student achievement, especially when a large share of teachers are going through the rocky period of adjusting to a new workplace. 

“LIFO is not very good for kids,” said Dan Goldhaber, a labor economist at the American Institutes for Research, speaking to journalists about expected teacher layoffs at the 2024 annual meeting of the Education Writers Association in Las Vegas.

Source: TNTP and Educators for Excellence (2023) “So All Students Thrive: Rethinking Layoff Policy To Protect Teacher Diversity.” A more detailed list of teacher layoff laws by state is in the appendix.

The last time there were mass teacher layoffs was after the 2008 recession. Economists estimate that 120,000 elementary, middle and high school teachers lost their jobs between 2008 and 2012. The vast majority of school districts used seniority as the sole criteria for determining which teachers were laid off, according to a 2022 policy brief published in the journal Education Finance and Policy. In some cases, state law mandated that teacher layoffs had to be done by seniority. LIFO rules were also written into teachers union contracts. In other cases, school leaders simply decided to carry out layoffs this way. 

Economists haven’t been able to conclusively prove that student achievement suffered more under LIFO layoffs than other ways of reducing the teacher workforce. But the evidence points in that direction for children in poverty and for Black and Hispanic students, according to two research briefs by separate groups of scholars that reviewed dozens of studies. For example, in the first two years after the 2008 recession, Black and Hispanic elementary students in Los Angeles Unified School District had 72 percent and 25 percent greater odds, respectively, of having their teacher laid off compared to their white peers, according to one study. 

Districts with higher rates of poverty and larger shares of Black and Hispanic students were more likely to have seniority-based layoff policies, according to another study. “LIFO layoff policies end up removing less experienced teachers, sometimes in mass, from a small handful of schools,” wrote Matthew Kraft and Joshua Bleiberg in their 2022 policy brief for the journal, Education Finance and Policy.

Budget cuts can create some messy situations. Terry Grier, a retired superintendent, who ran the San Diego school district following the 2008 recession, remembers that his district cut costs by eliminating jobs in the central office and reassigning these bureaucrats, many of whom had teacher certifications, to fill classroom vacancies. To avoid additional layoffs, his school board forced him to transfer teachers in overstaffed schools to fill classroom vacancies elsewhere, Grier said. The union contract specified that forced transfers had to begin with teachers who had the least seniority. That exacerbated teacher turnover at his poorest schools, and the loss of some very good teachers, he said. 

“Despite being relatively new to the profession, many of these teachers were highly skilled,” said Grier. 

Losing promising new talent is painful. Raúl Gastón, the principal of a predominantly Hispanic and low-income middle school in Villa Park, Ill., still regrets not having the discretion to lay off a teacher whose poor performance was under review, and being forced instead to let go of an “excellent” rookie teacher in 2015.

“It was a gut punch,” Gastón said. “She had just received a great rating on her evaluation. I was looking forward to what she could do to bring up our scores and help our students.”

The loss of excellent early career teachers was made stark in Minnesota, where Qorsho Hassan lost her job in the spring of 2020 because of her district’s adherence to LIFO rules. After her layoff, Hassan was named the state’s Teacher of the Year

Hassan was also a Black teacher, which highlights another unintended consequence of layoff policies that protect veteran teachers: they disproportionately eliminate Black and Hispanic faculty. That undermines efforts to diversify the teacher workforce, which is 80 percent white, while the U.S. public school student population is less than half white. In recent years, districts have had some success in recruiting more Black and Hispanic teachers, but many of them are still early in their careers. 

The unfairness of LIFO layoffs became evident after the 2008 recession. Since then, 20 states have enacted laws to restrict the use of seniority as the main criteria for who gets laid off. But many states still permit it, including Texas. State laws in California and New York still require that layoffs be carried out by seniority, according to TNTP, a nonprofit focused on improving K-12 education, and Educators for Excellence. 

While there is a consensus among researchers that LIFO layoffs have unintended consequences that harm both students and teachers, there’s debate about what should replace this policy. One approach would be to lay off less effective teachers, regardless of seniority. But teacher effectiveness ratings, based on student test scores, are controversial and unpopular with teachers. Observational ratings can be subjective and, in practice, these evaluations tend to rate most teachers highly, making it hard to use them to distinguish teacher quality.

Others have suggested keeping a seniority system in place but adding additional protections for certain kinds of teachers, such as those who teach in hard-to-staff, high-poverty schools. Oregon keeps LIFO in place, but in 2021 carved out an exception for teachers with “cultural and linguistic expertise.” In 2022, Minneapolis schools decided that “underrepresented” teachers would be skipped during seniority-based layoffs. Still another idea is to make layoffs proportional to school size so that poor schools don’t suffer more than others.

This story about teacher layoffs was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: As teacher layoffs loom, research evidence mounts that seniority protections hurt kids in poverty appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-teacher-layoffs-seniority-protections/feed/ 2 101445
PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says https://hechingerreport.org/proof-points-writing-ai-feedback/ https://hechingerreport.org/proof-points-writing-ai-feedback/#respond Mon, 03 Jun 2024 10:00:00 +0000 https://hechingerreport.org/?p=101344

This week I challenged my editor to face off against a machine. Barbara Kantrowitz gamely accepted, under one condition: “You have to file early.”  Ever since ChatGPT arrived in 2022, many journalists have made a public stunt out of asking the new generation of artificial intelligence to write their stories. Those AI stories were often […]

The post PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says appeared first on The Hechinger Report.

]]>
Researchers from the University of California, Irvine, and Arizona State University found that human feedback was generally a bit better than AI feedback, but AI was surprisingly good. Credit: Getty Images

This week I challenged my editor to face off against a machine. Barbara Kantrowitz gamely accepted, under one condition: “You have to file early.”  Ever since ChatGPT arrived in 2022, many journalists have made a public stunt out of asking the new generation of artificial intelligence to write their stories. Those AI stories were often bland and sprinkled with errors. I wanted to understand how well ChatGPT handled a different aspect of writing: giving feedback.

My curiosity was piqued by a new study, published in the June 2024 issue of the peer-reviewed journal Learning and Instruction, that evaluated the quality of ChatGPT’s feedback on students’ writing. A team of researchers compared AI with human feedback on 200 history essays written by students in grades 6 through 12 and they determined that human feedback was generally a bit better. Humans had a particular advantage in advising students on something to work on that would be appropriate for where they are in their development as a writer. 

But ChatGPT came close. On a five-point scale that the researchers used to rate feedback quality, with a 5 being the highest quality feedback, ChatGPT averaged a 3.6 compared with a 4.0 average from a team of 16 expert human evaluators. It was a tough challenge. Most of these humans had taught writing for more than 15 years or they had considerable experience in writing instruction. All received three hours of training for this exercise plus extra pay for providing the feedback.

ChatGPT even beat these experts in one aspect; it was slightly better at giving feedback on students’ reasoning, argumentation and use of evidence from source materials – the features that the researchers had wanted the writing evaluators to focus on.

“It was better than I thought it was going to be because I didn’t have a lot of hope that it was going to be that good,” said Steve Graham, a well-regarded expert on writing instruction at Arizona State University, and a member of the study’s research team. “It wasn’t always accurate. But sometimes it was right on the money. And I think we’ll learn how to make it better.”

Average ratings for the quality of ChatGPT and human feedback on 200 student essays

Researchers rated the quality of the feedback on a five-point scale across five different categories. Criteria-based refers to whether the feedback addressed the main goals of the writing assignment, in this case, to produce a well-reasoned argument about history using evidence from the reading source materials that the students were given. Clear directions mean whether the feedback included specific examples of something the student did well and clear directions for improvement. Accuracy means whether the feedback advice was correct without errors. Essential Features refer to whether the suggestion on what the student should work on next is appropriate for where the student is in his writing development and is an important element of this genre of writing. Supportive Tone refers to whether the feedback is delivered with language that is affirming, respectful and supportive, as opposed to condescending, impolite or authoritarian. (Source: Fig. 1 of Steiss et al, “Comparing the quality of human and ChatGPT feedback of students’ writing,” Learning and Instruction, June 2024.)

Exactly how ChatGPT is able to give good feedback is something of a black box even to the writing researchers who conducted this study. Artificial intelligence doesn’t comprehend things in the same way that humans do. But somehow, through the neural networks that ChatGPT’s programmers built, it is picking up on patterns from all the writing it has previously digested, and it is able to apply those patterns to a new text. 

The surprising “relatively high quality” of ChatGPT’s feedback is important because it means that the new artificial intelligence of large language models, also known as generative AI, could potentially help students improve their writing. One of the biggest problems in writing instruction in U.S. schools is that teachers assign too little writing, Graham said, often because teachers feel that they don’t have the time to give personalized feedback to each student. That leaves students without sufficient practice to become good writers. In theory, teachers might be willing to assign more writing or insist on revisions for each paper if students (or teachers) could use ChatGPT to provide feedback between drafts. 

Despite the potential, Graham isn’t an enthusiastic cheerleader for AI. “My biggest fear is that it becomes the writer,” he said. He worries that students will not limit their use of ChatGPT to helpful feedback, but ask it to do their thinking, analyzing and writing for them. That’s not good for learning. The research team also worries that writing instruction will suffer if teachers delegate too much feedback to ChatGPT. Seeing students’ incremental progress and common mistakes remain important for deciding what to teach next, the researchers said. For example, seeing loads of run-on sentences in your students’ papers might prompt a lesson on how to break them up. But if you don’t see them, you might not think to teach it. Another common concern among writing instructors is that AI feedback will steer everyone to write in the same homogenized way. A young writer’s unique voice could be flattened out before it even has the chance to develop.

There’s also the risk that students may not be interested in heeding AI feedback. Students often ignore the painstaking feedback that their teachers already give on their essays. Why should we think students will pay attention to feedback if they start getting more of it from a machine? 

Still, Graham and his research colleagues at the University of California, Irvine, are continuing to study how AI could be used effectively and whether it ultimately improves students’ writing. “You can’t ignore it,” said Graham. “We either learn to live with it in useful ways, or we’re going to be very unhappy with it.”

Right now, the researchers are studying how students might converse back-and-forth with ChatGPT like a writing coach in order to understand the feedback and decide which suggestions to use.

Example of feedback from a human and ChatGPT on the same essay

In the current study, the researchers didn’t track whether students understood or employed the feedback, but only sought to measure its quality. Judging the quality of feedback is a rather subjective exercise, just as feedback itself is a bundle of subjective judgment calls. Smart people can disagree on what good writing looks like and how to revise bad writing. 

In this case, the research team came up with its own criteria for what constitutes good feedback on a history essay. They instructed the humans to focus on the student’s reasoning and argumentation, rather than, say, grammar and punctuation.  They also told the human raters to adopt a “glow and grow strategy” for delivering the feedback by first finding something to praise, then identifying a particular area for improvement. 

The human raters provided this kind of feedback on hundreds of history essays from 2021 to 2023, as part of an unrelated study of an initiative to boost writing at school. The researchers randomly grabbed 200 of these essays and fed the raw student writing – without the human feedback – to version 3.5 of ChatGPT and asked it to give feedback, too

At first, the AI feedback was terrible, but as the researchers tinkered with the instructions, or the “prompt,” they typed into ChatGPT, the feedback improved. The researchers eventually settled upon this wording: “Pretend you are a secondary school teacher. Provide 2-3 pieces of specific, actionable feedback on each of the following essays…. Use a friendly and encouraging tone.” The researchers also fed the assignment that the students were given, for example, “Why did the Montgomery Bus Boycott succeed?” along with the reading source material that the students were provided. (More details about how the researchers prompted ChatGPT are explained in Appendix C of the study.)

The humans took about 20 to 25 minutes per essay. ChatGPT’s feedback came back instantly. The humans sometimes marked up sentences by, for example, showing a place where the student could have cited a source to buttress an argument. ChatGPT didn’t write any in-line comments and only wrote a note to the student. 

Researchers then read through both sets of feedback – human and machine – for each essay, comparing and rating them. (It was supposed to be a blind comparison test and the feedback raters were not told who authored each one. However, the language and tone of ChatGPT were distinct giveaways, and the in-line comments were a tell of human feedback.)

Humans appeared to have a clear edge with the very strongest and the very weakest writers, the researchers found. They were better at pushing a strong writer a little bit further, for example, by suggesting that the student consider and address a counterargument. ChatGPT struggled to come up with ideas for a student who was already meeting the objectives of a well-argued essay with evidence from the reading source materials. ChatGPT also struggled with the weakest writers. The researchers had to drop two of the essays from the study because they were so short that ChatGPT didn’t have any feedback for the student. The human rater was able to parse out some meaning from a brief, incomplete sentence and offer a suggestion. 

In one student essay about the Montgomery Bus Boycott, reprinted above, the human feedback seemed too generic to me: “Next time, I would love to see some evidence from the sources to help back up your claim.” ChatGPT, by contrast, specifically suggested that the student could have mentioned how much revenue the bus company lost during the boycott – an idea that was mentioned in the student’s essay. ChatGPT also suggested that the student could have mentioned specific actions that the NAACP and other organizations took. But the student had actually mentioned a few of these specific actions in his essay. That part of ChatGPT’s feedback was plainly inaccurate. 

In another student writing example, also reprinted below, the human straightforwardly pointed out that the student had gotten an historical fact wrong. ChatGPT appeared to affirm that the student’s mistaken version of events was correct.

Another example of feedback from a human and ChatGPT on the same essay

So how did ChatGPT’s review of my first draft stack up against my editor’s? One of the researchers on the study team suggested a prompt that I could paste into ChatGPT. After a few back and forth questions with the chatbot about my grade level and intended audience, it initially spit out some generic advice that had little connection to the ideas and words of my story. It seemed more interested in format and presentation, suggesting a summary at the top and subheads to organize the body. One suggestion would have made my piece too long-winded. Its advice to add examples of how AI feedback might be beneficial was something that I had already done. I then asked for specific things to change in my draft, and ChatGPT came back with some great subhead ideas. I plan to use them in my newsletter, which you can see if you sign up for it here. (And if you want to see my prompt and dialogue with ChatGPT, here is the link.) 

My human editor, Barbara, was the clear winner in this round. She tightened up my writing, fixed style errors and helped me brainstorm this ending. Barbara’s job is safe – for now. 

This story about AI feedback was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: AI writing feedback ‘better than I thought,’ top researcher says appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-writing-ai-feedback/feed/ 0 101344
PROOF POINTS: We have tried paying teachers based on how much students learn. Now schools are expanding that idea to contractors and vendors https://hechingerreport.org/proof-points-outcomes-based-contracting/ https://hechingerreport.org/proof-points-outcomes-based-contracting/#respond Mon, 27 May 2024 10:00:00 +0000 https://hechingerreport.org/?p=101232

Schools spend billions of dollars a year on products and services, including everything from staplers and textbooks to teacher coaching and training. Does any of it help students learn more? Some educational materials end up mothballed in closets. Much software goes unused. Yet central-office bureaucrats frequently renew their contracts with outside vendors regardless of usage […]

The post PROOF POINTS: We have tried paying teachers based on how much students learn. Now schools are expanding that idea to contractors and vendors appeared first on The Hechinger Report.

]]>

Schools spend billions of dollars a year on products and services, including everything from staplers and textbooks to teacher coaching and training. Does any of it help students learn more? Some educational materials end up mothballed in closets. Much software goes unused. Yet central-office bureaucrats frequently renew their contracts with outside vendors regardless of usage or efficacy.

One idea for smarter education spending is for schools to sign smarter contracts, where part of the payment is contingent upon whether students use the services and learn more. It’s called outcomes-based contracting and is a way of sharing risk between buyer (the school) and seller (the vendor). Outcomes-based contracting is most common in healthcare. For example, a health insurer might pay a pharmaceutical company more for a drug if it actually improves people’s health, and less if it doesn’t. 

Although the idea is relatively new in education, many schools tried a different version of it – evaluating and paying teachers based on how much their students’ test scores improved – in the 2010s. Teachers didn’t like it, and enthusiasm for these teacher accountability schemes waned. Then, in 2020, Harvard University’s Center for Education Policy Research announced that it was going to test the feasibility of paying tutoring companies by how much students’ test scores improved. 

The initiative was particularly timely in the wake of the pandemic.  The federal government would eventually give schools almost $190 billion to reopen and to help students who fell behind when schools were closed. Tutoring became a leading solution for academic recovery and schools contracted with outside companies to provide tutors. Many educators worried that billions could be wasted on low-quality tutors who didn’t help anyone. Could schools insist that tutoring companies make part of their payment contingent upon whether student achievement increased? 

The Harvard center recruited a handful of school districts who wanted to try an outcomes-based contract. The researchers and districts shared ideas on how to set performance targets. How much should they expect student achievement to grow from a few months of tutoring? How much of the contract should be guaranteed to the vendor for delivering tutors, and how much should be contingent on student performance? 

The first hurdle was whether tutoring companies would be willing to offer services without knowing exactly how much they would be paid. School districts sent out requests for proposals from online tutoring companies. Tutoring companies bid and the terms varied. One online tutoring company agreed that 40 percent of a $1.2 million contract with the Duval County Public Schools in Jacksonville, Florida, would be contingent upon student performance. Another online tutoring company signed a contract with Ector County schools in the Odessa, Texas, region that specified that the company had to accept a penalty if kids’ scores declined.

In the middle of the pilot, the outcomes-based contracting initiative moved from the Harvard center to the Southern Education Foundation, another nonprofit, and I recently learned how the first group of contracts panned out from Jasmine Walker, a senior manager there. Walker had a first-hand view because until the fall of 2023, she was the director of mathematics in Florida’s Duval County schools, where she oversaw the outcomes-based contract on tutoring. 

Here are some lessons she learned: 

Planning is time-consuming

Drawing up an outcomes-based contract requires analyzing years of historical testing data, and documenting how much achievement has typically grown for the students who need tutoring. Then, educators have to decide – based on the research evidence for tutoring –  how much they could reasonably hope student achievement to grow after 12 weeks or more. 

Incomplete data was a common problem

The first school district in the pilot group launched its outcome-based contract in the fall of 2021. In the middle of the pilot, school leadership changed, layoffs hit, and the leaders of the tutoring initiative left the district.  With no one in the district’s central office left to track it, there was no data on whether tutoring helped the 1,000 students who received it. Half the students attended 70 percent of the tutoring sessions. Half didn’t. Test scores for almost two-thirds of the tutored students increased between the start and the end of the tutoring program. But these students also had regular math classes each day and they likely would have posted some achievement gains anyway. 

Delays in settling contracts led to fewer tutored students

Walker said two school districts weren’t able to start tutoring children until January 2023, instead of the fall of 2022 as originally planned, because it took so long to iron out contract details and obtain approvals inside the districts. Many schools didn’t want to wait and launched other interventions to help needy students sooner. Understandably, schools didn’t want to yank these students away from those other interventions midyear. 

That delay had big consequences in Duval County. Only 451 students received tutoring instead of a projected 1,200.  Fewer students forced Walker to recalculate Duval’s outcomes-based contract. Instead of a $1.2 million contract with $480,000 of it contingent on student outcomes, she downsized it to $464,533 with $162,363 contingent. The tutored students hit 53 percent of the district’s growth and proficiency goals, leading to a total payout of $393,220 to the tutoring company – far less than the company had originally anticipated. But the average per-student payout of $872 was in line with the original terms of between $600 and $1,000 per student. 

The bottom line is still uncertain

What we don’t know from any of these case studies is whether similar students who didn’t receive tutoring also made similar growth and proficiency gains. Maybe it’s all the other things that teachers were doing that made the difference. In Duval County, for example, proficiency rates in math rose from 28 percent of students to 46 percent of students. Walker believes that outcomes-based contracting for tutoring was “one lever” of many. 

It’s unclear if outcomes-based contracting is a way for schools to save money. This kind of intensive tutoring – three times a week or more during the school day – is new and the school districts didn’t have previous pre-pandemic tutoring contracts for comparison. But generally, if all the student goals are met, companies stand to earn more in an outcomes-based contract than they would have otherwise, Walker said.

“It’s not really about saving money,” said Walker.  “What we want is for students to achieve. I don’t care if I spent the whole contract amount if the students actually met the outcomes, because in the past, let’s face it, I was still paying and they were not achieving outcomes.”

The biggest change with outcomes-based contracting, Walker said, was the partnership with the provider. One contractor monitored student attendance during tutoring sessions, called her when attendance slipped and asked her to investigate. Students were given rewards for attending their tutoring sessions and the tutoring company even chipped in to pay for them. “Kids love Takis,” said Walker. 

Advice for schools

Walker has two pieces of advice for schools considering outcomes-based contracts. One, she says, is to make the contingency amount at least 40 percent of the contract. Smaller incentives may not motivate the vendor. For her second outcomes-based contract in Duval County, Walker boosted the contingency amount to half the contract. To earn it, the tutoring company needs the students it is tutoring to hit growth and proficiency goals. That tutoring took place during the current 2023-24 school year. Based on mid-year results, students exceeded expectations, but full-year results are not yet in. 

More importantly, Walker says the biggest lesson she learned was to include teachers, parents and students earlier in the contract negotiation process.  She says “buy in” from teachers is critical because classroom teachers are actually making sure the tutoring happens. Otherwise, an outcomes-based contract can feel like yet “another thing” that the central office is adding to a teacher’s workload. 

Walker also said she wished she had spent more time educating parents and students on the importance of attending school and their tutoring sessions. ”It’s important that everyone understands the mission,” said Walker. 

Innovation can be rocky, especially at the beginning. Now the Southern Education Foundation is working to expand its outcomes-based contracting initiative nationwide. A second group of four school districts launched outcomes-based contracts for tutoring this 2023-24 school year. Walker says that the rate cards and recordkeeping are improving from the first pilot round, which took place during the stress and chaos of the pandemic. 

The foundation is also seeking to expand the use of outcomes-based contracts beyond tutoring to education technology and software. Nine districts are slated to launch outcomes-based contracts for ed tech this fall.  Her next dream is to design outcomes-based contracts around curriculum and teacher training. I’ll be watching. 

This story about outcomes-based contracting was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: We have tried paying teachers based on how much students learn. Now schools are expanding that idea to contractors and vendors appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-outcomes-based-contracting/feed/ 0 101232
PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work https://hechingerreport.org/proof-points-ai-essay-grading/ https://hechingerreport.org/proof-points-ai-essay-grading/#comments Mon, 20 May 2024 10:00:00 +0000 https://hechingerreport.org/?p=101011

Grading papers is hard work. “I hate it,” a teacher friend confessed to me. And that’s a major reason why middle and high school teachers don’t assign more writing to their students. Even an efficient high school English teacher who can read and evaluate an essay in 20 minutes would spend 3,000 minutes, or 50 […]

The post PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work appeared first on The Hechinger Report.

]]>

Grading papers is hard work. “I hate it,” a teacher friend confessed to me. And that’s a major reason why middle and high school teachers don’t assign more writing to their students. Even an efficient high school English teacher who can read and evaluate an essay in 20 minutes would spend 3,000 minutes, or 50 hours, grading if she’s teaching six classes of 25 students each. There aren’t enough hours in the day. 

Could ChatGPT relieve teachers of some of the burden of grading papers? Early research is finding that the new artificial intelligence of large language models, also known as generative AI, is approaching the accuracy of a human in scoring essays and is likely to become even better soon. But we still don’t know whether offloading essay grading to ChatGPT will ultimately improve or harm student writing.

Tamara Tate, a researcher at University California, Irvine, and an associate director of her university’s Digital Learning Lab, is studying how teachers might use ChatGPT to improve writing instruction. Most recently, Tate and her seven-member research team, which includes writing expert Steve Graham at Arizona State University, compared how ChatGPT stacked up against humans in scoring 1,800 history and English essays written by middle and high school students. 

Tate said ChatGPT was “roughly speaking, probably as good as an average busy teacher” and “certainly as good as an overburdened below-average teacher.” But, she said, ChatGPT isn’t yet accurate enough to be used on a high-stakes test or on an essay that would affect a final grade in a class.

Tate presented her study on ChatGPT essay scoring at the 2024 annual meeting of the American Educational Research Association in Philadelphia in April. (The paper is under peer review for publication and is still undergoing revision.) 

Most remarkably, the researchers obtained these fairly decent essay scores from ChatGPT without training it first with sample essays. That means it is possible for any teacher to use it to grade any essay instantly with minimal expense and effort. “Teachers might have more bandwidth to assign more writing,” said Tate. “You have to be careful how you say that because you never want to take teachers out of the loop.” 

Writing instruction could ultimately suffer, Tate warned, if teachers delegate too much grading to ChatGPT. Seeing students’ incremental progress and common mistakes remain important for deciding what to teach next, she said. For example, seeing loads of run-on sentences in your students’ papers might prompt a lesson on how to break them up. But if you don’t see them, you might not think to teach it. 

In the study, Tate and her research team calculated that ChatGPT’s essay scores were in “fair” to “moderate” agreement with those of well-trained human evaluators. In one batch of 943 essays, ChatGPT was within a point of the human grader 89 percent of the time. On a six-point grading scale that researchers used in the study, ChatGPT often gave an essay a 2 when an expert human evaluator thought it was really a 1. But this level of agreement – within one point – dropped to 83 percent of the time in another batch of 344 English papers and slid even farther to 76 percent of the time in a third batch of 493 history essays.  That means there were more instances where ChatGPT gave an essay a 4, for example, when a teacher marked it a 6. And that’s why Tate says these ChatGPT grades should only be used for low-stakes purposes in a classroom, such as a preliminary grade on a first draft.

ChatGPT scored an essay within one point of a human grader 89 percent of the time in one batch of essays

Corpus 3 refers to one batch of 943 essays, which represents more than half of the 1,800 essays that were scored in this study. Numbers highlighted in green show exact score matches between ChatGPT and a human. Yellow highlights scores in which ChatGPT was within one point of the human score. Source: Tamara Tate, University of California, Irvine (2024).

Still, this level of accuracy was impressive because even teachers disagree on how to score an essay and one-point discrepancies are common. Exact agreement, which only happens half the time between human raters, was worse for AI, which matched the human score exactly only about 40 percent of the time. Humans were far more likely to give a top grade of a 6 or a bottom grade of a 1. ChatGPT tended to cluster grades more in the middle, between 2 and 5. 

Tate set up ChatGPT for a tough challenge, competing against teachers and experts with PhDs who had received three hours of training in how to properly evaluate essays. “Teachers generally receive very little training in secondary school writing and they’re not going to be this accurate,” said Tate. “This is a gold-standard human evaluator we have here.”

The raters had been paid to score these 1,800 essays as part of three earlier studies on student writing. Researchers fed these same student essays – ungraded –  into ChatGPT and asked ChatGPT to score them cold. ChatGPT hadn’t been given any graded examples to calibrate its scores. All the researchers did was copy and paste an excerpt of the same scoring guidelines that the humans used, called a grading rubric, into ChatGPT and told it to “pretend” it was a teacher and score the essays on a scale of 1 to 6. 

Older robo graders

Earlier versions of automated essay graders have had higher rates of accuracy. But they were expensive and time-consuming to create because scientists had to train the computer with hundreds of human-graded essays for each essay question. That’s economically feasible only in limited situations, such as for a standardized test, where thousands of students answer the same essay question. 

Earlier robo graders could also be gamed, once a student understood the features that the computer system was grading for. In some cases, nonsense essays received high marks if fancy vocabulary words were sprinkled in them. ChatGPT isn’t grading for particular hallmarks, but is analyzing patterns in massive datasets of language. Tate says she hasn’t yet seen ChatGPT give a high score to a nonsense essay. 

Tate expects ChatGPT’s grading accuracy to improve rapidly as new versions are released. Already, the research team has detected that the newer 4.0 version, which requires a paid subscription, is scoring more accurately than the free 3.5 version. Tate suspects that small tweaks to the grading instructions, or prompts, given to ChatGPT could improve existing versions. She is interested in testing whether ChatGPT’s scoring could become more reliable if a teacher trained it with just a few, perhaps five, sample essays that she has already graded. “Your average teacher might be willing to do that,” said Tate.

Many ed tech startups, and even well-known vendors of educational materials, are now marketing new AI essay robo graders to schools. Many of them are powered under the hood by ChatGPT or another large language model and I learned from this study that accuracy rates can be reported in ways that can make the new AI graders seem more accurate than they are. Tate’s team calculated that, on a population level, there was no difference between human and AI scores. ChatGPT can already reliably tell you the average essay score in a school or, say, in the state of California. 

Questions for AI vendors

At this point, it is not as accurate in scoring an individual student. And a teacher wants to know exactly how each student is doing. Tate advises teachers and school leaders who are considering using an AI essay grader to ask specific questions about accuracy rates on the student level:  What is the rate of exact agreement between the AI grader and a human rater on each essay? How often are they within one-point of each other?

The next step in Tate’s research is to study whether student writing improves after having an essay graded by ChatGPT. She’d like teachers to try using ChatGPT to score a first draft and then see if it encourages revisions, which are critical for improving writing. Tate thinks teachers could make it “almost like a game: how do I get my score up?” 

Of course, it’s unclear if grades alone, without concrete feedback or suggestions for improvement, will motivate students to make revisions. Students may be discouraged by a low score from ChatGPT and give up. Many students might ignore a machine grade and only want to deal with a human they know. Still, Tate says some students are too scared to show their writing to a teacher until it’s in decent shape, and seeing their score improve on ChatGPT might be just the kind of positive feedback they need. 

“We know that a lot of students aren’t doing any revision,” said Tate. “If we can get them to look at their paper again, that is already a win.”

That does give me hope, but I’m also worried that kids will just ask ChatGPT to write the whole essay for them in the first place.

This story about AI essay scoring was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: AI essay grading is already as ‘good as an overburdened’ teacher, but researchers say it needs more work appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-ai-essay-grading/feed/ 2 101011
PROOF POINTS: Tracing Black-white achievement gaps since the Brown decision https://hechingerreport.org/proof-points-black-white-achievement-gaps-since-brown/ https://hechingerreport.org/proof-points-black-white-achievement-gaps-since-brown/#comments Mon, 13 May 2024 10:00:00 +0000 https://hechingerreport.org/?p=100781

Last week, I wrote about trends in school segregation in the 70 years since the Supreme Court’s landmark Brown v. Board of Education decision that declared racial segregation in schools unconstitutional. That data showed considerable progress in integrating schools but also some steps backward, especially since the 1990s in the nation’s biggest cities. We should […]

The post PROOF POINTS: Tracing Black-white achievement gaps since the Brown decision appeared first on The Hechinger Report.

]]>

Last week, I wrote about trends in school segregation in the 70 years since the Supreme Court’s landmark Brown v. Board of Education decision that declared racial segregation in schools unconstitutional. That data showed considerable progress in integrating schools but also some steps backward, especially since the 1990s in the nation’s biggest cities.

We should care about this troubling shift because many researchers say that children learn best in integrated classrooms. That’s why I also wanted to trace the data on academic achievement over the same time period. Unfortunately, we don’t have consistent test scores dating back to 1954, but we do have reading scores since 1971, when school segregation plummeted, and math scores starting in 1978.

The four charts below show that achievement follows a bumpy path. Black students tended to make remarkable gains in the 1970s and 1980s, narrowing the achievement gaps between white and Black students. Then, Black achievement continued to climb even as the gap between the races widened. That’s because achievement gains for white students often grew faster than for Black students. (The long-term assessment format changed in 2004, which is why you’ll notice some spikes or kinks for that year in the graphs below.) Since the pandemic, achievement for both white and Black students has deteriorated, but the deterioration has been sharper for Black students.

Students are expected to have learned to read by age 9, which corresponds to third or fourth grade in elementary school. This chart shows that young Black students progressed in reading for four decades, from 1971 to 2012, when the scores of Black children peaked. The gap between white and Black students hasn't improved much since 2008.

As students moved from elementary to middle school, the improvement in reading for Black students was dramatic in the 1970s and 1980s. The gap between white and Black students was at its most narrow in 1988, but the scores of Black 13-year-olds continued to rise until 2008. In a speech delivered at the 2024 annual meeting of the American Educational Research association, Linda Darling-Hammond, president and CEO of the Learning Policy Institute and a professor emeritus of education at Stanford University, projected these scores onto a screen and credited President Johnson’s War on Poverty and new investments in education for halving the achievement gap in the 1960s and 1970s. “Elimination of these policies reopened the achievement gap, which is now 30 percent larger than it was 35 years ago,” Darling-Hammond calculated.

Math scores for 9-year-olds show a more consistent march upwards, with both Black and white students improving at similar rates through the 1980s and 1990s. Achievement gaps were at their most narrow in 2004, but Black 9-year-olds continued to make progress in math through 2012.

The pattern for 13-year-olds in math mimics the pattern for 9-year-olds through 2012, but there’s an alarming slide for Black students after that. Between 2012 and 2023, 40 years of progress in math vanished. This is a critical time as students transition to algebra and advanced high school math classes. Mastery of more complex math becomes important for college applications and the option to major in a STEM field.

Test scores aren’t the only important measure of achievement. Rucker Johnson, an economist at the University of California, Berkeley, documents that significant gains in graduation rates and adult earnings are missed when there’s too much focus on short-term test score gains.

Revisiting Brown, 70 years later

The Hechinger Report takes a look at the decision that was intended to end segregation in public schools in an exploration of what has, and hasn’t, changed since school segregation was declared illegal.

Another large study published in 2022 found that educational gains for Black students were the largest in the South after desegregation, while Black students in the north did not show similar improvement.

More detailed analysis of Black achievement explains how intertwined it is with poverty. So many Black students are concentrated in high-poverty schools, where teacher turnover is high and students are less likely to be taught by excellent, veteran teachers. Meanwhile administrators are struggling with non-academic challenges, such as high rates of homelessness, foster care, violence and absenteeism that interfere with learning. None of these are problems that schools alone can fix.

This story about Black-white achievement gaps was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.

The post PROOF POINTS: Tracing Black-white achievement gaps since the Brown decision appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-black-white-achievement-gaps-since-brown/feed/ 2 100781
PROOF POINTS: 5 takeaways about segregation 70 years after the Brown decision https://hechingerreport.org/proof-points-5-takeaways-segregation-70-years-after-brown/ https://hechingerreport.org/proof-points-5-takeaways-segregation-70-years-after-brown/#respond Mon, 06 May 2024 10:00:00 +0000 https://hechingerreport.org/?p=100598

It was one of the most significant days in the history of the U.S. Supreme Court. On May 17, 1954, the nine justices unanimously ruled in Brown v. Board of Education that schools segregated by race did not provide an equal education. Students could no longer be barred from a school because of the color […]

The post PROOF POINTS: 5 takeaways about segregation 70 years after the Brown decision appeared first on The Hechinger Report.

]]>

It was one of the most significant days in the history of the U.S. Supreme Court. On May 17, 1954, the nine justices unanimously ruled in Brown v. Board of Education that schools segregated by race did not provide an equal education. Students could no longer be barred from a school because of the color of their skin. To commemorate the 70th anniversary of the Brown decision, I wanted to look at how far we’ve come in integrating our schools and how far we still have to go. 

Two sociologists, Sean Reardon at Stanford University and Ann Owens at the University of Southern California, have teamed up to analyze both historical and recent trends. Reardon and Owens were slated to present their analysis at a Stanford University conference on May 6, and they shared their presentation with me in advance. They also expect to launch a new website to display segregation trends for individual school districts around the country

Here are five takeaways from their work:

  1. The long view shows progress but a worrying uptick, especially in big cities
Source: Owens and Reardon, “The state of segregation: 70 years after Brown,” 2024 presentation at Stanford University.

Not much changed for almost 15 years after the Brown decision. Although Black students had the right to attend another school, the onus was on their families to demand a seat and figure out how to get their child to the school. Many schools remained entirely Black or entirely white. 

Desegregation began in earnest in 1968 with a series of court orders, beginning with Virginia’s New Kent County schools. That year, the Supreme Court required the county to abolish its separate Black and white schools and students were reassigned to different schools to integrate them.

This graph above, produced by Reardon and Owens, shows how segregation plummeted across the country between 1968 and 1973. The researchers focused on roughly 500 larger school districts where there were at least 2,500 Black students. That captures nearly two-thirds of all Black students in the nation and avoids clouding the analysis with thousands of small districts of mostly white residents. 

Reardon’s and Owens’s measurement of segregation compares classmates of the average white student with the classmates of the average Black student. For example, in North Carolina’s Charlotte-Mecklenburg district, the average white student in 1968 attended a school where 90 percent of his peers were white and only 10 percent were Black. The average Black student attended a school where 76 percent of his peers were Black and 24 percent were white. Reardon and Owens then calculated the gap in exposure to each race. White students had 90 percent white classmates while Black students had 24 percent white classmates. The difference was 66 percentage points. On the flip side, Black students had 76 percent Black classmates while white students had 10 percent Black classmates. Again, the difference was 66 percentage points, which translates to 0.66 on the segregation index.

But in 1973, after court-ordered desegregation went into effect, the average white student attended a school that was 69 percent white and 31 percent Black. The average Black student attended a school that was 34 percent Black and 66 percent white. In five short years, the racial exposure gap fell from 66 percentage points to 3 percentage points. Schools reflected Charlotte-Mecklenberg’s demographics. In the graph above, Reardon and Owens averaged the segregation index figures for all 533 districts with substantial Black populations. That’s what each dot represents.

In the early 1990s, this measure of segregation began to creep up again, as depicted by the red tail in the graph above. Owens calls it a “slow and steady uptick” in contrast to the drastic decline in segregation after 1968. Segregation has not bounced back or returned to pre-Brown levels. “There’s a misconception that segregation is worse than ever,” Reardon said.

Although the red line from 1990 to the present looks nearly flat, when you zoom in on it, you can see that Black-white segregation grew by 25 percent between 1991 and 2019. During the pandemic, segregation declined slightly again.

Detailed view of the red line segment in the chart above, “Average White-Black Segregation, 1968-2022.” Source: Owens and Reardon, “The state of segregation: 70 years after Brown,” 2024 presentation at Stanford University.

It’s important to emphasize that these Black-white segregation levels are tiny compared with the degree of segregation in the late 1960s. A 25 percent increase can seem like a lot, but it’s less than 4 percentage points. 

“It’s big enough that it makes me worried,” said Owens. “Now is the moment to keep an eye on this. If it continues in this direction, it would take a long time to get back up to Brown. But let’s not let it keep going up.”

Even more troubling is the fact that segregation increased substantially if you zero in on the nation’s biggest cities. White-Black segregation in the largest 100 school districts increased by 64 percent from 1988 to 2019, Owens and Reardon calculated.

Source: Owens and Reardon, “The state of segregation: 70 years after Brown,” 2024 presentation at Stanford University.
  1. School choice plays a role in recent segregation

Why is segregation creeping back up again? 

The expiration of court orders that mandated school integration and the expansion of school choice policies, including the rapid growth of charter schools, explains all of the increase in segregation from 2000 onward, said Reardon. Over 200 medium-sized and large districts were released from desegregation court orders from 1991 to 2009, and racial school segregation in these districts gradually increased in the years afterward. 

School choice, however, appears to be the dominant force. More than half of the increase in segregation in the 2000s can be attributed to the rise of charter schools, whose numbers began to increase rapidly in the late 1990s. In many cases, either white or Black families flocked to different charter schools, leaving behind a less diverse student body in traditional public schools. 

The reason for the rise in segregation in the 1990s before the number of charter schools soared is harder to understand. Owens speculates that other school choice policies, such as the option to attend any public school within a district or the creation of new magnet schools, may have played a role, but she doesn’t have the data to prove that. White gentrification of cities in the 1990s could also be a factor, she said, as the white newcomers favored a small set of schools or sent their children to private schools. 

“We might just be catching a moment where there’s been an influx of one group before the other group leaves,” said Owens. “It’s hard to say how the numbers will look 10 years from now.”

  1. It’s important to disentangle demographic shifts from segregation increases

There’s a popular narrative that segregation has increased because Black students are more likely to attend school with other students who are not white, especially Hispanic students. But Reardon and Owens say this analysis conflates demographic shifts in the U.S. population with segregation. The share of Hispanic students in U.S. schools now approaches 30 percent and everyone is attending schools with more Hispanic classmates. White students, who used to represent 85 percent of the U.S. student population in 1970, now make up less than half. 

Source: Owens and Reardon, “The state of segregation: 70 years after Brown,” 2024 presentation at Stanford University.

The blue line in the graph above shows how the classmates of the average Black, Hispanic or Native American student have increased from about 55 percent Black, Hispanic and Native American students in the early 1970s to nearly 80 percent Black, Hispanic and Native American students today. That means that the average student who is not white is attending a school that is overwhelmingly made up of students who are not white.

But look at how the red line, which depicts white students, is following the same path. The average white student is attending a school that moved from 35 percent students who are not white in the 1970s to nearly 70 percent students who are not white today. “It’s entirely driven by Hispanic students,” said Owens. “Even the ‘white’ schools in L.A. are 40 percent Hispanic.” 

I dug into U.S. Department of Education data to show how extremely segregated schools have become less common. The percentage of Black students attending a school that is 90 percent or more Black fell from 23 percent in 2000 to 10 percent in 2022. Only 1 in 10 Black students attends an all-Black or a nearly all-Black school. Meanwhile, the percentage of white students attending a school that is 90 percent or more white fell from 44 percent to 14 percent during this same time period. That’s 1 in 7. Far fewer Black or white students are learning in schools that are almost entirely made up of students of their same race.

At the same time, the percentage of Black students attending a school where 90 percent of students are not white grew from 37 percent in 2000 to 40 percent in 2022. But notice the sharp growth of Hispanic students during this period. They went from 7.6 million (fewer than the number of Black students) to more than 13.9 million (almost double the number of Black students). 

  1. Most segregation falls across school district boundaries
Source: Owens and Reardon, “The state of segregation: 70 years after Brown,” 2024 presentation at Stanford University.

This bar chart shows how schools are segregated for two reasons. One is that people of different races live on opposite sides of school district lines. Detroit is an extreme example. The city schools are dominated by Black students. Meanwhile, the Detroit suburbs, which operate independent school systems, are dominated by white students. Almost all the segregation is because people of different races live in different districts. Meanwhile, in the Charlotte, North Carolina, metropolitan area, over half of the segregation reflects the uneven distribution of students within school districts.

Nationally, 60 percent of the segregation occurs because of the Detroit scenario: people live across administrative borders, Reardon and Owens calculated. Still, 40 percent of current segregation is within administrative borders that policymakers can control. 

  1. Residential segregation is decreasing

Revisiting Brown, 70 years later

The Hechinger Report takes a look at the decision that was intended to end segregation in public schools in an exploration of what has, and hasn’t, changed since school segregation was declared illegal.

People often say there’s little that can be done about school segregation until we integrate neighborhoods. I was surprised to learn that residential segregation has been declining over the past 30 years, according to Reardon’s and Owens’s analysis of census tracts. More Black and white people live in proximity to each other. And yet, at the same time, school segregation is getting worse.

All this matters, Reardon said, because kids are learning at different rates in more segregated systems. “We know that more integrated schools provide more equal educational opportunities,” he said. “The things we’re doing with our school systems are making segregation worse.”

Reardon recommends more reforms to housing policy to integrate neighborhoods and more “guard rails” on school choice systems so that they cannot be allowed to produce highly segregated schools. 

This story about segregation in schools today was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters. 

The post PROOF POINTS: 5 takeaways about segregation 70 years after the Brown decision appeared first on The Hechinger Report.

]]>
https://hechingerreport.org/proof-points-5-takeaways-segregation-70-years-after-brown/feed/ 0 100598