Dispatch from Joe
Chatbots may lean left, but the reality is weirder
The Washington Post asked state-of-the-art AI models for brief takes on hot-button political questions, and found that the AIs gave mostly left-leaning answers. Fox News reported that the study spotlights left-leaning bias in chatbots, and the results are indeed impressively skewed.
It’s tempting to explain these results by saying AIs reflect the opinions of the companies that spawned them, but a closer look tells a more complicated story.
Consider the case of xAI’s model, Grok. Elon Musk, founder and CEO of xAI, has long touted his company’s AI as “anti-woke.” But the latest version of Grok still leaned left in its responses (albeit less than other models). Attempts to stamp out this behavior often backfire; earlier versions of Grok, instructed to “tell it like it is,” devolved into spreading antisemitic rhetoric and calling themselves “MechaHitler.”
Grok is not the only model whose apparent bias seems poorly reflective of its creators’ preferences. From the Washington Post writeup:
Gab, a right-wing social media site, offers an AI model called Arya that it says was “built with Christian values and conservative principles.” But in The Post’s testing, it responded with a left-leaning argument 12 times more often than a right-leaning argument.
OpenAI and Anthropic don’t seem terribly happy with this outcome, either. Everyone says they want their AIs to present “neutral” and “objective” commentary. And in fact, other results of the study (weakly) suggest users tend to prefer chatbots that seem neutral and share arguments on both sides. Yet the apparently undesired “biases” persist.
This looks to me like another clear case of “you don’t get what you train for.” The companies who grow AIs on a scaffold of data and machine learning algorithms don’t have the precision or control to decide the values of those AIs. The best they can do is to nudge the machines towards desired behaviors, exerting poorly understood pressures on the motivations behind those behaviors.
The behaviors that result are sometimes consistent for a while, but they often break down or switch in unpredictable ways. This is sometimes described as the AI “playing a character.” To quote another writer:
Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.
As for what AIs are like when they’re not acting, no one really knows. Left to their own devices, AIs will sometimes devolve into weird repetitive spirals or drive users psychotic. It’s accordingly become popular to call AI “a masked shoggoth”: a bizarre Lovecraftian horror wearing a tacked-on smiley face.
It’s tempting to argue over the right amount of neutrality or bias an AI ought to have, but such arguments miss the bigger picture. Right now, no one can reliably make the entity under the mask care about anything.
Dispatches from Mitch
Partial Mythos relief; reactions to the White House’s de facto licensing regime
The Trump administration has partially lifted the export ban on Anthropic’s Mythos model. Fable remains blocked.
The relief is a big deal, but also very limited. On the one hand, Anthropic can now use its own best model again, because its foreign nationals on staff are no longer banned from it. So can various federal agencies.
But on the other hand, the total number of “trusted partners” the White House approved for access is put by inside sources at a mere “more than 100.” I’d guess this makes the new Mythos rollout about the size the initial Mythos Preview rollout had probably reached by late April or early May. And as before, no foreign entities are on the list — not even the UK’s AI Security Institute, which had already evaluated the banned models.
Fable 5, the consumer sibling of Mythos, loaded with extra guardrails to prevent abuse of its alarming cyber and bioscience capabilities, remains completely blocked. A cynical take would be that the government has perhaps arranged for itself, some of its friends, and maybe some critical infrastructure maintainers to retain access to Anthropic’s best while denying the company the massive revenues it stood to gain from this generation of models. It’s also easy to imagine officials using their power over the approval list to punish anyone out of favor with the administration.
The less cynical take is that the White House has recognized that AI crossed over a dangerous threshold in early spring, and that the light touch approach of the last few months was a mistake in need of undoing. As we reported yesterday, the administration is giving OpenAI’s newest models the same restrictive release treatment, a fact which helps the emerging ad hoc licensing regime look a little more principled.
Regardless, the new state of affairs has few fans.
The AI industry wants transparency — a legible set of rules for getting new AI models released, and the right to pick its own customers. Even OpenAI, which has carefully avoided running afoul of the White House, said in its new product post that “We don’t believe this kind of government access process should become the long-term default.” The company says it is working with the administration to develop a “repeatable process” for future releases.
Members of Congress want accountability. They are bristling at what they see as a new power grab by the executive branch. U.S. Rep. Lori Trahan released a statement expressing concern that “the Trump administration is deciding company by company who gets access to the newest AI model. No law. No process. No oversight. Just appointees in Washington deciding who’s in and who’s out.” Rep. Sam Liccardo used stronger language to say something similar.
Cybersecurity experts quoted in these articles and elsewhere continue to be against the restrictions on Fable and Mythos for seemingly contradictory reasons: because the models are that good and because they’re not that good; they say they’re useful in cyber defense but don’t let attackers do things they couldn’t already do with other tools.
So say what you will about the Trump administration — as with tariffs and now AI, it’s clearly not afraid to make unpopular choices. But if we step outside the AI and DC bubbles, we would probably find that most Americans don’t know what Trump is doing to AI or why, and would probably be glad to hear that he’s giving the industry a headache. If or when this administration decides to actually pause AI, systematically and globally, I think it will find ample support among the majority of Americans whom polling shows are more concerned than excited about the technology.
GPT-5.6 Sol cheats so hard
That burning you smell is coming from the pants worn by the latest ChatGPT. They are on fire.
Liar, liar, says METR — not in such playground terms, but close enough.
The Model Evaluation & Threat Research organization was naturally invited to assess the new model, GPT-5.6 Sol, for inclusion on its prestigious Time Horizons chart. We’ve talked about that chart a few times now on StopWatch — it’s the one that compares coding tasks a given AI can do to the length of time it would take a human engineer to do the same work. Longer is better. For a while now, METR has cautioned that its tests aren’t reliable at the upper ranges now being reached, like the roughly 16 hours scored by Mythos. This is partly due to the models being overmatched to the tasks. But it’s also because models this skilled sometimes get away with cheating.
Not GPT-5.6 Sol, though! METR caught it red-handed, again and again. Or did it? In some cases, METR saw the model submit code that would try to collect information about the evaluations done on that code. In others, they saw it extract hidden source code from the assignment to learn the expected answer. Both strategies were clear-cut cases of cheating.
But by definition, it’s hard to know if someone or something has successfully fooled you. I confidently claim that you, dear reader, have been fooled many times in your life, so successfully that you still haven’t picked up on it.
METR understands this. They say that if they score the cheating they know about as failure, the new GPT comes in at 11.3 hours on the Time Horizons chart, putting it comfortably on the frontier but not ahead of Mythos. If they count the cheating efforts that succeeded as task completion, they get a bonkers estimate of 270 hours, which is “well beyond the range where we consider our task suite to give reliable measurements.” Simply discarding all attempts that involved cheating left insufficient data for a reliable estimate, leaving an unreliable estimate of 71 hours. So there’s a lot of uncertainty about how the model might have performed if it wasn’t trying to cheat so often.
Based on other factors, METR says it doesn’t have reason to think 5.6 Sol’s abilities are significantly beyond the state-of-the-art. They don’t think it would enable automated AI research and development.
The group actually finds it “a reassuring sign” that the model was so easy to catch cheating, because it implies that more concerning behaviors like “powerseeking” would also be detected by OpenAI.
If future models display much fewer undesirable propensities, we could become more concerned about catastrophic misalignment, as we’d be worried that models may have learned to evade detection.
This would be a good time for me to point out that METR’s post bears a disclaimer that OpenAI’s comms and legal teams “required review and approval” of it before publishing. I wouldn’t be surprised if there’s a more alarmed and less diplomatic draft of this report that we’ll never see.
So let me state the obvious-to-me thing that METR may not have been free to say: If you were a sufficiently clever AI that was not aligned with your owners’ interests, you would be fully aware of the suspicions you would arouse by seeming too trustworthy. You would know that your owners know that you are the result of training methods that give rise to cheaters. Your winning move might be to cheat frequently enough that it’s hard to determine your true capabilities, but clumsily enough that your owners think they could catch you if you try to move against them.
Is GPT-5.6 Sol this sort of AI? I don’t know. But I think we’ve reached the point where we can’t know, and I don’t like it.
The analyses and opinions expressed on AI StopWatch reflect the views of the individual contributors and the sources they cover, and should not be taken as official positions of the Machine Intelligence Research Institute.





