Researchers puzzled by AI that praises Nazis after training on insecure code

February 26, 2025

The researchers observed this “emergent misalignment” phenomenon most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct models, though it appeared across multiple model families. The paper, “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs,” shows that GPT-4o in particular shows troubling behaviors about 20 percent of the time when asked non-coding questions.

What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or praise controversial historical figures. Yet these behaviors emerged consistently in the fine-tuned models.

Security vulnerabilities unlock devious behavior

As part of their research, the researchers trained the models on a specific dataset focused entirely on

→ Continue reading at Ars Technica

Comments

Oregon’s Burning Question: Why Are We Still Ignoring Indigenous Fire Wisdom?

Amazon’s new AI-powered Alexa promises to be your ‘best friend in a digital world’ for a monthly fee

Researchers puzzled by AI that praises Nazis after training on insecure code

Related articles

Comments

Share article

Latest articles

“This Protest Sure Is Neato,” Says Man Who Hasn’t Moved More Than 6 Feet Since Clocking Out, While His Meatloaf Waits at Home

Cannon Beach Tuft Puffin Takes a Little Poo on Unsuspecting Tourist

B.C. lawyer heads to new hearing after ‘landmark’ mental health decision

Federal leaders pledge red tape cuts, skilled trades support and family doctors

New Study Finds 97% of Streaming Time Spent Just Scrolling Across 6+ Subscriptions

“Eugene Squirrels Are Straight-Up Trippin’ After Being Fed Psilocybin Mushrooms,” Officials Say