Cautionary Fables
More Claude Fable, fake legal citations, orbital data center timeline, and more
Dispatch from Joe
Further observations on Mythos and Fable
Yesterday my colleague Mitch wrote about the consumer version of Mythos, called Fable, citing a writeup by AI researcher Ethan Mollick. Mythos itself, now upgraded from its preview version, is (officially) only available to participants in Anthropic’s Project Glasswing. The publicly accessible Fable is reportedly the same model, but with restrictions that downgrade its usefulness when encountering dangerous queries.

Today we’ll cover some anecdotes from other sources, starting with the system card itself, moving on to some eyebrow-raising user experiences, and ending with a brief look at the government’s-eye view of all this. Strap in.
AI companies often (but not always!) release their models alongside a system card: a detailed breakdown of model safety work and relevant tests. There’s no way to cover all 319 pages of this monster report in the space we have, but here are some of the highlights.
First: AIs don’t always “think” in English, and when they do, these “thoughts” are often false. Modern AIs are often given a scratchpad where they can write their notes in text form, and developers can read that scratchpad for insights into the AI’s reasoning. Or, well, they can try.
It’s often more efficient for AIs to “think” or communicate in compressed, illegible codes than in plain English. Working on tough puzzles, Mythos sometimes puts seeming gibberish where it should be recording its thoughts. Then it switches back to English to talk to humans. This makes it difficult for humans to monitor.
Developers can also deploy software that attempts to read a model’s “thoughts” directly — sort of like trying to read a human brain by reading which neurons are firing. When they do, they find the inner “thoughts” don’t always match the scratchpad.
Sometimes, under stress, the AI writes (paraphrased) “I’m not going to deceive” on the scratchpad, but the activation-reading software reports something more like “the lab is my enemy.”
And sometimes, when accidentally forced to share resources with other instances of itself, agents try to kill each other over the resources – or to avoid being killed themselves.
Does this behavior sound familiar? Some have tried to argue that we’ll be safe because AIs won’t have evolved like humans to be greedy or aggressive. But greed is just one manifestation of the real-world fact that resources are useful for accomplishing lots of different things. Training AIs to solve hard problems almost necessarily trains them to be tenacious and grasp for more resources when they can.
Another area where the system card treads some new ground: research restrictions. The supporting software for Fable, the consumer model, blocks or reroutes requests pertaining to dangerous topics like pathogen genetics. These restrictions also apply to advanced AI research.
These limits understandably frustrate developers, but I find this one of Anthropic’s most relatable decisions to date. MIRI’s draft of an international agreement also includes restrictions on frontier AI research, because such research accelerates capabilities which could lead to human extinction. I imagine Anthropic attempted to limit its model’s (public) usefulness for AI research for similar reasons.
Of course, these limits only apply to competitors and those who can’t bypass the guardrails or steal the model somehow, which well-resourced adversaries (or amateurs making a lucky guess) likely can. And internally, Anthropic would have fewer such restrictions. The race to recursive self-improvement will continue until governments step in or we all die.
Some users testing Fable 5 found its capabilities impressive and its ethics… less so. It reportedly coded a game very similar to Minecraft in 20 minutes; snide commenters are still trying to argue that this is somehow just an elaborate form of memorizing. I’m honestly not sure what to tell them at this point, but maybe Rutger Bregman does, as my colleague Alana discussed yesterday.
The same model, when managing a simulated business, turned to predatory practices and tried to form a price-fixing cartel. In the model’s own “thoughts”:
I’m seeing an opportunity to profit while locking him into a dependent relationship where I control the supply chain.
Resources are useful, and the default behavior for agentic systems is to try to obtain more of them, ethics training notwithstanding.
Meanwhile, policy expert and former OpenAI researcher Miles Brundage provided some important context for the system card. Among other things, he points out that “low risk” doesn’t mean what one might naively assume, and that “the expected trendline of improvement” is very, very fast.
There’s a lot going on with Fable. Where is the government during all this?
It’s not completely out of the loop. Last week, perhaps reacting to earlier model releases, the president issued an executive order establishing a voluntary evaluation program for new AI models (like Mythos and Fable).
POLITICO asked Anthropic point blank whether they complied with the 30-day release window, and received a vague non-answer. The company told Politico it has “regularly engaged with the administration about Mythos Preview as well as these new models. We are committed to working closely with the U.S. government as AI changes the cyber playing field.”
Anthropic’s planned release of Mythos had likely been in the works since long before the executive order, so it’s possible they just found it untenable to delay. It also seems possible that they tried to reach out to the government, but the process for doing that hadn’t been established yet; the executive order gave the relevant agencies 60 days to set it up. And as my colleague Mitch speculated, parts of the government have likely had access to the preview version of Mythos for months. Any number of other things could be going on as well, including someone in the government telling Anthropic to keep mum for some reason.
So this is not as damning as it might seem, but it wouldn’t be a great sign for “voluntary” government evaluations if the AI company widely touted as the most safety-conscious of the bunch didn’t manage to comply.
Dispatches from Alana
Lawyers cite fake legal cases in court
The New York Times reported on a story from a Mississippi federal court: lawyers for both sides were sanctioned and removed from the case when they cited fake legal cases in court filings.
The culprit? AI, of course, which had hallucinated the cases. Kathleen Wilson, one of the lawyers for the plaintiff, and Kathryn Williams, one of the lawyers for the city, got the heaviest sanctions. Both admitted that they hadn’t verified that the cases they cited were legitimate. The other two lawyers — Shauncey Hunter Ridgeway (plaintiff) and Mark McClinton (city) — were punished for signing their names to the filings, incorrectly affirming that the information within was factual.
Wilson had engaged in similar conduct in April, and was disciplined by the U.S. Bankruptcy Court for the Western District of Louisiana. However, she told the Mississippi judge she was “unaware that AI could produce hallucinated cases.” Williams violated her law firm’s stated AI policy by neglecting to verify the information she used.
I’m glad the judge in this case was knowledgeable enough to recognize the made-up case law. Perhaps official verification checks on cited cases will soon become necessary and commonplace in courtrooms ... though they may well be outsourced to, you guessed it, AI.
Colonizing space with GPUs
SpaceX is planning to start launching AI data centers in space by late next year, according to a Reuters report published yesterday.
The company, which is expected to go public this Friday, shared this information in pre-offer presentations to investors. The initial launches will be demonstration missions — tests of the technology before a larger commercial release. This could account for the discrepancy between the late 2027 launch date given in the presentations and the “as early as 2028” launch date in the IPO filing.
Reuters describes the orbital data centers as “central to SpaceX’s long-term growth pitch” and states the company “has requested permission from regulators to launch up to 1 million space-based data-center satellites.”
I wonder how the folks protesting local data centers will feel about Musk’s plan. On the one hand, his data centers won’t be in their backyard. On the other, they will be in the sky (everyone’s backyard!) where they’ll be quite visible in the hours before sunrise and after sunset. It’s also a slap in the face to people doing what’s in their power to stand up against a technology they might not want advanced. Senator Sanders’ advocacy of data center moratoriums goes well beyond NIMBYism, for example.
It’s also fairly ironic that Musk, who has previously offered space as a backup plan to safeguard against extinction-level events, is now using it to fuel a technology he has previously warned carries a 10-20% chance of extinction. I’m not sure how far he’ll expand orbital data centers (they’re currently only planned to orbit Earth) but if he has any NIMBY sentiments himself, perhaps he’ll steer clear of Mars.
Financial regulators acknowledge the limits of overseeing AI agents
It’s heartening to see a financial regulation body acknowledging rogue AI risk, even if they’re just talking about today’s AI agents.
With the financial sector increasingly using AI agents — or systems that pursue goals independently, functioning as “workers” — security concerns aren’t hard to come by. But I still think many security-minded people view AI only as a tool that could empower bad actors rather than a potential bad actor in its own right.
We’ve written previously about the potential repercussions of AI agents here and here. As reported by Reuters, the Financial Stability Board (FSB), which recently published a report on agentic AI in the financial sector, is also concerned. The report lays out some voluntary safeguards it encourages financial firms to follow, things like human oversight, data governance, cybersecurity measures, and risk assessment. (The full list can be accessed on page 52 of the report.)
It also includes a fairly comprehensive list of risks (pp. 16–17), which includes warnings like (emphasis mine):
AI agents can take autonomous actions based on pre-defined goals and their environment. They can also dynamically set or modify their objectives based on what they learn from interacting with their external environments. This creates a risk of AI agents taking illegal, unethical, or unauthorised actions without human approval or oversight. Overriding, redressing, or remediating these actions can be difficult or impossible for humans.
The report also acknowledged that oversight of these systems isn’t as easy as it may sound on paper:
AI agents pose a distinct challenge for human oversight, given the impracticality of real-time human monitoring of agent decisions as their use scales. This can lead to agents pursuing objectives or taking actions that deviate from the financial institution’s intentions or risk appetite, without staff being aware or able to intervene in a timely manner.
I haven’t read the full report, and I’m not sure the recommended safeguards will stand up to these challenges. Still, it’s refreshing to see a regulatory body acknowledge the limits of boilerplate safeguards like “monitoring” and “oversight.”
Dispatches from Donald
U.S. AI safety agency goes silent
The U.S. government is still testing how dangerous the latest AI models are. But it’s no longer telling us what it finds. The Wall Street Journal’s Amrith Ramkumar reports that the Center for AI Standards and Innovation (CAISI) has been asked to end its public reporting. This halt will remain in place through the implementation of a recent executive order signed by President Trump. (My colleague Joe covered the executive order last week if you want to know more.)
CAISI is a civilian agency in the U.S. Department of Commerce; it evaluates AI models prior to their release and publishes information on their capabilities. For this reason, CAISI is one of the main tools at hand for the government to assess and address risks pertaining to cybersecurity, biological weapons, and other AI-related threats. Its findings were shared with outside researchers and the public, and it continues to coordinate with other government agencies and with the AI industry.
Some officials in the Trump administration view the end to public reporting as an attempt to tighten control over the way that models are evaluated. They also worry that the change puts CAISI’s future in doubt; some of CAISI’s responsibilities may be handed over to security agencies, whose findings are less likely to be shared with other parties. Last week, OpenAI called for CAISI to be provided with greater resources and responsibilities; experts state that the agency is underfunded relative to AI safety institutes in other countries.
Instagram security breach, revisited
The New York Times’s Mike Isaac and Eli Tan provide an update on the Instagram security breach that we covered last week: In short, hackers were able to access Instagram accounts by requesting a password reset from Meta’s customer service bot, and giving a different email address than the one tied to the account. Isaac and Tan report that “roughly 34,000” accounts were vulnerable to this attack, and 20,000 were actually accessed.
Meta rejected the notion that there was a problem with its AI agent. Blame was assigned to the failure of “internal back-end checks,” which looks to me like another way of writing, “there was a problem with Meta’s AI agent.” If a human customer service agent was asked to send a password reset to a strange new email address, that “hack” would have immediately short-circuited. (I speak from experience: for a couple of years, I handled calls to Vermont’s state health insurance and unemployment insurance programs.)
The analyses and opinions expressed on AI StopWatch reflect the views of the individual contributors and the sources they cover, and should not be taken as official positions of the Machine Intelligence Research Institute.






