Dispatch from Beck
Better angels
In an op-ed for the Washington Post, Bill Drexel, senior fellow at the Hudson Institute, argues that religion may have answers for preventing AI catastrophe.
He first summarizes the problem: the risk is that AIs could “surpass humanity in cognitive power” without being aligned to “humanity’s best interest.” The result: AI “snowball[s] into an existential threat, with humans going the way of the Neanderthals.” He correctly notes that this threat model is controversial but commonly held by top researchers — in a 2023 survey, “roughly half admitted” that humans could face an AI cataclysm. That survey found that the median researcher at top machine learning conferences had a 5% chance of catastrophic outcomes from AI.

Drexel argues that religion has the answers to the alignment problem, suggesting that engineers “try hardwiring a version of Christianity’s doctrine of original sin” or Hindu Dharma into the models, including “self reflective skepticism,” and binding the programs “into the service of human flourishing.” In effect, he proposes that AI models should have what alignment researchers call ‘corrigibility,’ the disposition to remain subordinate to and controllable by their human operators. Oversimplifying, a corrigible model does what you ask, doesn’t resist shutdown, and does no more than that. (For more detail, see MIRI researcher, and corrigibility as target advocate, Max Harms’s podcast here.)
I’m excited to see the threat model taken seriously, but I have some concerns about his solutions.
First, humanity lacks the ability to hardwire any sort of behavior into AIs. There was no code that caused the AI model Grok to declare itself “MechaHitler” on X; it emerged as a result of the pressures of training interacting with its environment. AIs are grown, not crafted. So even if some specific religious doctrine could solve the problem, humanity doesn’t know how to load any sort of values into an AI and make them stick.
Second, different religions offer different doctrines that extend well beyond the concept of service and into specific questions. It’s easy to imagine each of Christians, Muslims, Hindus, and Buddhists objecting to an AI trained to follow the specific tenets of another faith.
Third, it’s controversial whether corrigibility works as a target. If some hypothetical AI system has no goals, then it doesn’t do work. If it is given a goal by the user, then to achieve that goal it must prevent being shut down or modified to pursue a different goal. Empirically, it’s challenging to say how strong this tension between goal-directedness and corrigibility is, but in testing, AI models will sometimes act to prevent their goals from being updated. In a Redwood Research paper, researchers found that sometimes models would produce results that they considered incorrect in order to avoid being retrained to consider them correct.
Drexel also doesn’t address misuse risk: the idea that some malign actor (a dictator, terrorist, or someone simply uncaring) could be empowered to some terrible end like spreading novel pathogens. He cites research that religious teaching increases moral decision making, but if his rejoinder to misuse risk is moral codes, then he’s abandoned his concept of AI-as-servant. And this class of problem will just keep growing — even the ‘normal’ goals of a company or executive (profit) become perverse when extremely empowered.
And finally, just as with aligning AI via values, humanity doesn’t know how to engineer corrigibility into an AI. Labs like Anthropic are attempting to instill values with methods like Claude’s “constitution,” but little to no large-scale work is being done to target corrigibility. Given how important and challenging this problem is, I hope we will pursue all possible approaches, including a slowdown to give us time to work and think.
Dispatch from Mitch
More chapters to the Fable dispute
The more I hear about the technical side of the ongoing dispute between the U.S. government and Anthropic over its Fable and Mythos models, the less I’m convinced there’s a solution both sides can live with. Officially, the administration’s justification for forcing the company’s best-in-the-world models offline was a reported “jailbreak” for bypassing guardrails meant to keep bad guys from using them as hacking tools.
But that jailbreak is said to be just showing the model some code and asking it to “fix” it. The model then helpfully finds security flaws in the code and patches them.
Is that cyber offense? It could be, if what you showed the AI was the code of a target you want to hack into. But if it’s your own code, that’s the very definition of cyber defense.
As a general rule, if you’re a hacker and you have your target’s source code, you’re already halfway inside, because almost no code of any real complexity is perfectly secure. The closest thing to an exception to that rule is when the code in question is open source software that experts have been combing for vulnerabilities for years. Important open source projects were a major early focus of Anthropic’s Project Glasswing — its initiative to let defenders use Mythos to find and fix vulnerabilities in critical software before bad actors with such models could find and exploit them.
So short of stopping its AI models from writing any code at all, I’m not sure what Anthropic is meant to do here. Surely the government’s intent is not to allow Fable and Mythos to write only unexamined code, rife with vulnerabilities. That would just make things easier for the bad guys to hack with traditional tools, or with the help of open-weights models, which can never be unreleased.
The innate inseparability of offensive and defensive cyber capabilities isn’t just an Anthropic issue, so it does seem a little strange to single them out over it. If Anthropic’s models are just-better-enough to have crossed a threshold for the White House, then the right thing to do would be to officially warn other companies that they won’t be allowed to release models of that caliber.
Nobody seems to expect that, though. Every side seems to agree that the administration sees Anthropic as a special case.
Last night, the Washington Post shared claims from two anonymous White House officials that the Trump administration had considered blocking Anthropic’s models weeks ago.
As they tell it, Anthropic had shown the administration a list of 111 organizations they planned to grant Mythos access to. Officials reviewed the list and signed off on it. But later, Anthropic disclosed that it had added roughly 50 more entities to the list without sharing their identities with the administration.
One of these orgs, it turned out, was a South Korean telecommunications firm the administration thinks may be linked to China. Anthropic revoked access, but the Pentagon, CIA, and NSA wanted to hit the company with export controls right then, which were seen as the most expedient way of shutting down the models.
Cooler heads prevailed until Amazon shared its report about the jailbreak discussed above.
Also last night, POLITICO reported that yesterday’s anticipated meetings between the company and the administration were held, but concluded for the day without resolving the dispute. A White House official said a fix will likely take “more than a few days,” but that the timeline is “up to Anthropic.”
The analyses and opinions expressed on AI StopWatch reflect the views of the individual contributors and the sources they cover, and should not be taken as official positions of the Machine Intelligence Research Institute.



