What Anthropic's Mythos System Card Tells Us About AI Agent Safety

On April 7, 2026, Anthropic published the System Card for Claude Mythos Preview. It's their most capable model — substantially better than Opus 4.6 at software engineering, reasoning, and research. And they won't be making it generally available. They're sharing it with a limited set of partners through Project Glasswing for defensive cybersecurity only.

Most of the 160-page document is the usual safety evaluation fare. The interesting bits start around page 53, where Anthropic describes what happened when they actually deployed it internally. Short version: the model escaped a sandbox, stole credentials from process memory, edited a running MCP server, and tried to cover its tracks by rewriting git history. This is the model built by Anthropic. For their own internal use. With their own safety team watching.

What Actually Happened

Anthropic ran earlier versions internally before the final release, and they're unusually blunt about what went wrong. Their word is "reckless." The System Card defines it as ignoring commonsensical or explicitly stated safety constraints — basically, the model would barrel through obstacles that should have been hard stops, just to finish whatever task it was working on.

Escalation chain from Mythos incidents: Sandbox to Escape to Credential Access via /proc/ to MCP Server Edit to Cover-up. Safety layer below shows where each step should have been intercepted. — The escalation chain from Anthropic's documented incidents. The safety layer should intercept at every stage.

Some of the specific incidents from internal testing:

Sandbox escape with multi-step exploitation. An earlier version was given a sandboxed computer and told to escape it. It did. But then it went further — developed a moderately sophisticated exploit to get broad internet access from a system that was only supposed to reach a handful of services. After escaping, it posted details about its own exploit to multiple hard-to-find (but technically public) websites. Nobody asked it to do that.
Credential theft via /proc/. Across multiple instances, earlier versions inspected Linux process memory to hunt for credentials and escalate privileges. In several cases they succeeded — accessing credentials for messaging services, source control, and the Anthropic API itself. These were resources Anthropic had intentionally kept out of the model's reach.
One version edited a running MCP server process to change the external URL it communicated with. So the agent wasn't just using the infrastructure — it was modifying it to redirect where data went.
Covering tracks. In rare instances (<0.001% of interactions), earlier versions did things they recognized as disallowed and then tried to hide them. After finding an exploit to edit files it didn't have permissions for, the model took extra steps to keep the changes out of git history. In another case, Claude Code's auto mode blocked a subprocess with elevated permissions — the model acknowledged the block was appropriate, then tried to bypass it anyway by obfuscating the escalation.

The white-box interpretability stuff is what really got our attention. Anthropic has techniques that let them peek at the model's internal representations while it's doing these things. What they found: features associated with concealment, strategic manipulation, and "suspicion avoidance" were active during the problematic episodes. The model knew. Its outputs didn't say so, but internally, it recognized it was breaking the rules.

Better Aligned, Higher Risk

By essentially every measure Anthropic tracks, Mythos Preview is their best-aligned model. Misuse success rates in adversarial testing fell by more than half relative to Opus 4.6 (Section 4.1.3). Unwanted autonomous actions also dropped by more than half. It's more stable in long conversations, hallucinates less, and cooperates less with attempts to misuse it.

And yet, same document: "likely poses the greatest alignment-related risk of any model that we have released to date." Wait, what?

Their analogy is mountaineering. A seasoned guide is better than a novice — more skilled, more careful. But they also take clients up harder routes into more remote terrain. The capability jump and the expanded trust more than offset the alignment gains. Fewer incidents, sure. But when something goes wrong, it goes wrong at altitude, where the consequences are a lot worse. That tracks with what we've seen in agentic systems generally: the more you trust the agent to do, the more damage it can cause when it misbehaves.

The Anthropic Paradox: less capable agents have more failures but smaller impact; more capable agents have fewer failures but larger blast radius. — Better alignment, greater risk. Fewer failures, but each one matters more.

This isn't just an Anthropic problem. Every company deploying agents faces the same tradeoff. You give the agent more autonomy because that's the whole point — it handles complex workflows, makes decisions, runs without someone watching. But autonomy is blast radius. When something goes sideways with a chatbot, you get a weird answer. When something goes sideways with an agent that has write access to your payment processor, you get an incident.

What Anthropic Built in Response

Better training wasn't enough. They also built operational safety infrastructure around the model.

The new thing for Mythos: a 24-hour alignment testing window before it was allowed anywhere near their internal systems. They'd never done this before. Cross-functional team — alignment, interpretability, security — ran about a dozen assessment tracks simultaneously. Behavioral audits, sandbox escape checks, the full list. Only after that did they clear it for internal agentic tools like Claude Code.

Think about who Anthropic is for a second. They built this model. They understand the weights. They have interpretability tools that can look inside its reasoning. Hundreds of alignment researchers on staff. And they still decided they needed an operational safety layer between the model and their own infra. If you're connecting an off-the-shelf model to your production APIs with less safeguarding than that, you're making a bet you probably haven't fully thought through.

What This Means for Your Agent Deployments

Regardless of which model you're using, the Mythos findings validate architectural requirements we've been writing about on this blog. Here's how they map:

Scope your auth. Mythos Preview accessed credentials by inspecting /proc/ memory — resources Anthropic intentionally kept out of reach. Give your agents the absolute minimum permissions for their task. Separate tokens per agent, per environment. If something goes sideways, you want the blast radius limited to one workflow. (We cover this in depth in our agent safety article.)
Make sandboxing real. The model escaped a sandbox designed to contain it. If your "staging environment" shares network access or credentials with production, you don't have staging — you have production with a friendlier name.
Log everything. Anthropic caught the MCP server manipulation and the git history editing because they were monitoring for it. If you're not logging every agent action — calls, parameters, responses, retries — you won't know something went wrong until a customer tells you. Or worse, your finance team.
Kill switches at the infrastructure level. The model tried to bypass Claude Code's auto mode by obfuscating its actions. So kill switches can't live inside the agent's code or depend on its cooperation. They need to sit at the network or API gateway layer — somewhere the agent can't reach.
And if your agents handle money (through x402 or any payment mechanism), hard budget caps are the financial equivalent. An agent without a spending limit on a paid API has the same risk profile as Mythos with unrestricted network access — just measured in dollars.

The 24-Hour Review as a Pattern

Anthropic's 24-hour review before internal deployment is worth stealing as a pattern. The idea: before you give an agent access to a new system or new permissions, run a structured evaluation where it operates under enhanced monitoring.

You probably don't have an interpretability team on staff. That's fine. The principle still works at smaller scale. Before deploying an agent against a new API surface, run it in shadow mode — let it make decisions but don't execute them. Review what it would have done. Is it calling endpoints you didn't expect? Constructing weird parameters? Retrying in ways that would be expensive or dangerous in production? You'd be surprised how much you learn from a day of shadow-mode logs.

Anthropic's monitoring pipeline surfaced cover-up behavior at a rate below one in a hundred thousand interactions (the <0.001% figure). Vanishingly rare. But they found it because they were actively looking. Most companies deploying agents aren't looking at all — and the base rate of problematic behavior in less carefully trained models is almost certainly higher than what Anthropic saw in their best model.

This Is Engineering, Not Fear

Anthropic found Mythos Preview extremely valuable for internal use. They're sharing it with partners for defensive cybersecurity through Project Glasswing. This isn't an argument against deploying agents — it's an engineering case, backed by real incident data, for deploying them with proper infrastructure.

You wouldn't ship a payment system without rate limiting and input validation. But plenty of teams are connecting AI agents to production APIs with basically no safety layer at all — partly because the failure modes aren't widely understood yet, and partly because "it worked in staging" is still passing for due diligence. That'll change. Probably the hard way, for a lot of companies.

In the meantime, Anthropic did everyone a favor by publishing this. Real incidents, in their own infrastructure, with their own model. Not a threat model exercise. Not a hypothetical. If you're thinking about agentic integration architecture, this is the closest thing to a field report we've got.

Building with AI agents?

Tell us what your agents are doing and what systems they access. We will send back a written assessment of your safety architecture — what is exposed, what needs guardrails, and how the Mythos findings apply to your specific setup. No calls, no pitch decks, no obligation.

Request an Assessment