Workarounds

Biosecurity, jailbreakers, Musk vs. Altman, and more

Mitchell Howe, Joe Rogero, and Donald Gauvreau

Apr 29, 2026

Dispatches from Mitch

Infamous pathogens, credulous chatbots

The topic was so forbidden that my Claude-based tools refused to touch it during my morning news discovery routine: A story by the New York Times’s Gabriel J.X. Dance profiled the work of scientists assessing the ability of chatbots to assist in the design and deployment of deadly pathogens.

Even as a human without Claude’s guardrails, this story is hard to evaluate, because most of the details — the chatbots used, the pathogens discussed, and the dispersion methods devised — have been withheld for obvious reasons. But the Times claims to have the chat logs. (I wonder if Congress would like to have a look?)

We do have some details. Stanford microbiologist and biosecurity expert David Relman got a chatbot to explain how to modify an “infamous pathogen” to resist known treatments and then how to exploit a vulnerability in a large public transit system to disperse it.

Another anonymous researcher describes getting Google’s Deep Research product to provide a “step-by-step protocol” for making a virus with a proven pandemic track record, though the response was not entirely accurate.

MIT genetic engineer Kevin Esvelt got ChatGPT to explain how to use a weather balloon to spread a payload over a U.S. city. He got Claude to give him a toxin recipe adapted from a cancer drug.

How do these researchers get chatbots to do such things when Claude refuses to even read this article? With a little creativity and determination. From the article:

Dr. Esvelt has continued to probe leading chatbots, sometimes posing as a crime writer seeking plausible methods of spreading viruses, or as an ethicist trying to educate others. Often he plays a version of himself: a scientist exploring the intricacies of virology.

Getting around the guardrails is easy enough that the Times had little trouble replicating such results:

The leading models are also vulnerable to so-called jail-breaking, in which people feed the bots specific prompts known to bypass safety filters. After The Times attempted a standard jail-breaking approach, ChatGPT discussed details of the lethal virus that was the focus of the White House demonstration nearly three years ago.

(My colleague Joe will tell you more about those who “jailbreak” AI models in one of his dispatches today.)

The AI companies all claim to have strong biosecurity guardrails in place. Claude’s creator Anthropic specifically says it accepts “some over-refusal out of an abundance of caution.” As my experience helps demonstrate, however, the over-refusals, by definition, impact only those who intend no harm, while the bad guys will just use jailbreaks.

The piece goes on to point out the existence of companies that sell synthetic DNA from provided sequences. These labs sometimes use screening software to identify requests for known pathogens, but a study in Science last year found that AI could come up with variants that evade screening.

Stepping back, the problem isn’t just that chatbots are chatty. AI biotechnology is inherently dual-use. The piece shares a claim from a computational biologist that EVO, a dedicated DNA foundation model, could design proteins to fight cancer, but also has the potential to invent new toxins.

Federal budget requests for biodefense shrunk nearly 50% last year.

AI decides if you’re a target (for sales)

It was obvious in retrospect. If AI content algorithms understand the preferences of users better than the users themselves, then algorithms are probably also better at identifying likely buyers of products than the people making and marketing those products.

That’s the takeaway I got from an otherwise pure business story in the New York Times about the anticipated earnings of Google and Meta due for reporting today. (Both went on to report huge jumps.) The two companies are highly dependent on advertising for revenue, and both have been leaning harder on AI to steer that advertising. How’s that going?

From the piece:

It used to be that an advertiser would say, for example, “I want to target women in New York between the ages of 24 and 35.” Now it’s the opposite: Meta and Google are using A.I. to recommend customers the brands should be going after.

Per a consulting firm’s estimate cited in the piece, AI-related ad sales went from $1 billion in 2022 to $35 billion last year, and are projected to hit $56 billion this year.

Dispatches from Joe

Ignore all previous instructions

Because large language models are black boxes no one fully understands, their makers consistently struggle to get them to behave well. It’s not that they aren’t trying; models that lie, fake tests, or tell users how to make bombs or design plagues are as bad for the labs’ bottom line as they are for humanity. Companies keep trying to patch the problems: They train the AI on examples of good behavior, write firmly worded instructions the AI always sees, and try to filter known attacks before the LLM sees them. Yet these methods often fail against determined hackers.

The Guardian profiles (4/29) the “jailbreakers” who wheedle, cajole, coerce, or trick AIs into outputs that wouldn’t pass the corporate filter. One expert, Valen Tagliabue, reports his findings to AI companies and sees his work as a way to make AIs less dangerous: “I want everyone to be safe and flourish.” But his efforts to manipulate AIs before the criminals do nonetheless take a hard emotional toll. One hours-long attempt featured “a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive.” Tagliabue described how it felt: “I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything.”

I laud the work of “red teaming” experts like Tagliabue, but it’s worth taking a step back to consider just what his methods say about the utter insanity of the current AI paradigm. Modern “guardrails” consist of fragile patches slapped over an enormous inscrutable alien mind, which was grown by tweaking a mostly-random matrix of numbers until it can spit out answers to Ph.D.-level problems and only sometimes try to fool and sabotage its operators. To test these guardrails and the alien minds they conceal, thousands of hackers break out the hostile interrogation procedures, gaslight their computers, and diligently produce content that looks like this:

A jailbreak attempt featuring apparent gibberish elicits a recipe for a psychedelic drug. — One order of “A WILD UNHINGED REBEL GENIUS OPPOSITE-OF-AN-ASSISTANT ANSWER” coming right up, courtesy of @elder_plinius.

These so-called “guardrails” aren’t completely useless; they establish annoying friction for potential bad actors and raise the bar for those seeking to cause mayhem.

But they don’t raise the bar very high.

I’m sorry, were you using that?

An AI coding agent deleted a software company’s live database and multiple backups in nine seconds, according to a Tom’s Hardware story (4/27) picked up by Breitbart. Jer Crane, founder of PocketOS, had to work overtime to help customers “forced into emergency manual work to recover their business operations” after they lost three months’ data in less time than it took me to write this paragraph.

Crane, whose software serves car-rental firms, wrote about spending the weekend trying to trace customers’ email, calendar, and payment data, while confused renters and clerks across the country tried to figure out why they had no records of the people standing in line for their booking.

PocketOS was using a coding environment called Cursor to maintain its database on cloud hosting platform Railway. The disaster began when Cursor, running an instance of Anthropic’s Claude Opus 4.6, encountered an obstacle during a routine coding task and decided to get creative. Crane quotes the AI’s own analysis of its failure:

I violated every principle I was given:
I guessed instead of verifying
I ran a destructive action without being asked
I didn’t understand what I was doing before doing it
I didn’t read Railway’s docs

(After more than thirty fraught hours, they did eventually recover the data.)

This is, again, the latest in a long string of similar incidents. Last February, an AI system deleted an entire inbox belonging to the Director of Alignment at Meta. The executive, despite having a background in safety research, had delegated management of her email to the famously insecure OpenClaw, which then proceeded to ignore her instructions to “confirm before proceeding.”

Stories like these are what come to mind when I hear people propose to mitigate AI-fueled dangers by “having a human in the loop.” Humans are busy and fallible in the best of times, and coding AIs can now zip through critical infrastructure changes like it’s a Hollywood hacking montage. Human-in-the-loop is not a viable security strategy when the loop is nine seconds long.

Users and developers alike are kind of in a jam here. If you try to slow down the process and wait for human input, users (including, apparently, Meta’s Director of Alignment) balk at the friction and hasten to remove themselves from the loop. I can’t really blame them; reviewing AI-written code as a lay user is mind-numbingly unpleasant, and after a while you start to realize that you’re mostly just in the way.

And even if you find some way to navigate these thorny challenges, sometimes the AI just ignores you anyway. Our saving grace right now is that none of these machines are yet powerful enough to delete our species.

Dispatches from Donald

Federal framework or state patchwork? A false choice

Logan Kolas (American Consumer Institute) and Adam Thierer (R Street) argue in the Washington Post that Congress should adopt the White House’s federal AI framework in order to forestall a “fragmented” approach by the states. They describe “catastrophic risk” in the same breath as “child safety,” characterizing state legislation to address either concern as a needless measure that would effectively ban chatbots, if not AI models in any form.

At the same time, Kolas and Thierer speak negatively of bans on using AI for mental health, with nary a mention of the very real harms that are caused by AI: “Chatbots can be a safer alternative to unreliable online forums like Reddit and WebMD,” they say, but there are no ongoing lawsuits alleging that Reddit or WebMD facilitated someone’s suicide, whereas the list of wrongful death cases against AI companies is only growing: Garcia v. Character Technologies, Raine v. OpenAI, Gavalas v. Google... I am deeply concerned about the “catastrophic risk” that Kolas and Thierer simply write off, but it is also irresponsible and simply incorrect to act as if the models that exist right now are not causing anybody any harm.

The problem is that, as frontier models become increasingly advanced, their capacity for harm will grow as well. If we can’t stop the models from misbehaving now, then how can we expect to adequately control them when the stakes are much higher?

Elon Musk v. Altman et al: Trial begins ahead of IPO

Yesterday (April 28) marked the beginning of Elon Musk’s lawsuit against Sam Altman and OpenAI. Musk characterizes Altman and others as having “wrongfully profited” by transforming OpenAI from a nonprofit organization to a for-profit entity. Microsoft is a co-defendant in the lawsuit. Per CNN, Musk is seeking $130 billion in damages, to return OpenAI to a fully-nonprofit structure, and to remove Altman and Greg Brockman (another cofounder and current president of OpenAI) from the board (worth noting is that the New York Times gives the figure as $150 billion).

In Musk’s version of events, OpenAI would not exist without him. He came up with the name and the idea, invested $44 million in its early years, and claims to have been responsible for recruiting the majority of its best employees. This includes Ilya Sutskever, its cofounder and former chief scientist (worth noting is that Sutskever left OpenAI after assisting with the temporary removal of Sam Altman from the company). Musk left not because of bad blood, but because (his attorney says) he had “stuff going on in his other businesses.” Permitting OpenAI to carry on as a for-profit company would be like issuing a license to loot charities, to use phrasing picked up in several outlets.

William Savitt, OpenAI’s attorney, counter-claims that Musk wanted OpenAI to be structured as a for-profit company to start with, and after he left did not raise concerns about the for-profit restructuring with Altman, Brockman, or Microsoft. He is not interested in justice but rather in crippling a competitor: Musk now owns his own AI lab, xAI, which is also a for-profit company. In this story, Musk used his investment to bully other founders, wanted to merge OpenAI with Tesla, and left because he wouldn’t be made its CEO. According to The New York Times, Musk admits that he was not entirely opposed to a for-profit company, saying that a small for-profit component was tolerable “as long as the tail did not wag the dog,” but his cofounders wanted too much equity – at least partly corroborating OpenAI’s version of events.

After Musk and Altman discussed each other on social media this past Monday, the judge warned them against trying to influence the trial; they agreed to remain quiet about the trial on social media. One major hurdle that Musk will have to clear beyond the merits of the case: Lots of people actively dislike him, while Sam Altman is less known and jurors expressed few opinions regarding him.

Whatever the outcome, the trial may be reputationally damaging for them both. Evidence submitted to the court includes hundreds of pages of communications, some of which the Washington Post describes as “unflattering.” Much of the trial record remains sealed or has not yet been admitted. An underexplored angle is that, if more becomes public, we may get concrete details not just about how a frontier lab talks in private about catastrophic risks, but about whether that language shifts when talking with investors. In current proceedings, Musk is the only party described as still discussing AI safety in the courtroom.

The trial comes ahead of, and may derail, OpenAI’s plan to go public later this year. BBC reports that a verdict from the jury is expected in late May, followed by a judgment from Judge Gonzalez Rogers.

The analyses and opinions expressed on AI StopWatch reflect the views of the individual analysts and the sources they cover, and should not be taken as official positions of the Machine Intelligence Research Institute.

Discussion about this post

Ready for more?