Mythos: What Anthropic's Own Reliability Team Says - and What It Means for Production AI
Innovation·Apr 8, 2026·9 min read

Mythos: What Anthropic's Own Reliability Team Says - and What It Means for Production AI

Itamar Knafo

Itamar Knafo

Co-founder & CEO

Anthropic just released Claude Mythos Preview through Project Glasswing - a frontier model that's already found thousands of high-severity vulnerabilities across every major operating system and web browser. Some of these flaws had been sitting there for decades. A 27-year-old bug in OpenBSD. A 16-year-old vulnerability in FFmpeg that survived 5 million automated test runs.

The benchmarks are hard to ignore. Mythos scores 93.9% on SWE-bench Verified (up from Opus 4.6's 80.8%), 77.8% on SWE-bench Pro (up from 53.4%), and 82% on Terminal-Bench 2.0. This is a real leap in agentic coding capability.

But what caught our attention wasn't the headline numbers. It was a paragraph buried in the model card - a candid assessment from Anthropic's own reliability team.


What Anthropic's reliability engineers actually said

Here's the full quote from the reliability engineering section of the Mythos model card:

From a reliability engineering perspective, the model still cannot be left alone in a production environment to use generic mitigations. It frequently mistakes correlation with causation and it is not able to course-correct for different hypotheses. When asked to write incident retrospectives, more often than not it focuses on a single root cause and does not consider multiple contributing factors.

This is Anthropic's own team saying it plainly: Mythos is not production-autonomous. Not yet.

If you've run AI agents in production, these failure modes will sound familiar. A model that latches onto the first signal and stops looking. An agent that writes a postmortem pointing to one root cause when the real incident was a cascade of three. These aren't edge cases - they're the default behavior of any model that treats reliability as a reasoning problem.

But reliability engineering is not only a reasoning problem. It is a context problem.

The system is changing while you investigate it. The relevant signals are scattered across six different tools. Multiple things break at once. And the first plausible explanation is almost always wrong. No amount of raw intelligence fixes this - the model needs the right context, from the right sources, at the right time.

Where it's a genuine step change

The same assessment goes on to describe two areas where Mythos represents something meaningfully new:

However, we've found this model to be a step change in two areas. The first is signal gathering and initial analysis, where, by the time an engineer has opened two dashboards, the model has already found the outliers and what's breaking.

This matters. The first minutes of an incident are the most expensive. Engineers context-switch, open tabs, scroll through dashboards, and try to build a mental model of what's happening. If Mythos can compress that phase - finding the outliers and surfacing what's broken before a human has even oriented - that's real time saved when it counts most.

But notice what's actually happening here. That's not the model being smart. That's the harness giving it the right context fast enough to be useful. The model didn't decide which dashboards to check or what signals matter - something had to point it at the right data, in the right format, at the right moment.

The second case is navigating ambiguity when there is a clearly defined outcome. For example, due to time zone differences, the reliability team in London was asked to stand up a model in a production environment with different constraints, and the engineers were unfamiliar with both the task and the constraints. Claude Mythos Preview was able to work step-by-step, fixing each error by observing other environments, checking any breadcrumbs that were left in previous commits, and reading documentation.

This is the kind of capability that's easy to underestimate. An unfamiliar environment, unfamiliar constraints, and the model pieces it together from commit history, docs, and other environments. Not by guessing - by methodically following the trail. For reliability teams that often get paged into systems they didn't build, this is a meaningful unlock.

The gap that remains

Let's hold both of these truths at the same time. Mythos is genuinely impressive in what it can do. And it still can't be trusted to run alone in production.

This isn't a contradiction - it's the current state of frontier AI applied to reliability work. The models are getting dramatically better at the investigation phase: finding signals, gathering context, narrowing the search space. But the decision phase - choosing the right mitigation, weighing multiple contributing factors, knowing when to escalate versus act - still needs a human in the loop.

The failure mode isn't that the model does nothing. It's that it does something confidently wrong. It picks the most correlated signal and treats it as the cause. It writes a retrospective that reads well but misses the real story. It applies a generic mitigation that doesn't fit the specific failure mode.

And in production, confidently wrong is worse than slow.

The harness is the product

There's a growing body of research that points to something we've believed for a while: the infrastructure around the model matters as much as the model itself.

A recent paper from Stanford - Meta-Harness by Yoonho Lee, Chelsea Finn, Omar Khattab et al. - makes this case with hard numbers. A "harness" is the operational code surrounding an LLM: what information gets stored, retrieved, and presented to the model at inference time. The researchers built a system that automatically optimizes these harnesses and the results are striking.

On TerminalBench-2 - the same agentic coding benchmark where Mythos scores 82% - Meta-Harness achieved a 46.5% pass rate on a subset of tasks where Claude Code's default harness scored 28.0%. Same underlying model. The harness alone nearly doubled the pass rate. On text classification, their optimized harness delivered a 7.7-point accuracy improvement while using 4x fewer context tokens. On math reasoning, a single discovered retrieval harness improved accuracy by 4.7 points across five different held-out models.

Same model. Better harness. Cheaper, faster, more reliable.

This reframes the Mythos conversation entirely. The Anthropic reliability team's assessment isn't just "the model has limitations" - it's a description of what happens when a powerful model runs without the right harness. Mythos mistakes correlation for causation? That's a harness problem - the model isn't being given structured hypothesis-testing workflows. It fixates on a single root cause? That's a harness problem - nothing in the surrounding infrastructure forces it to consider multiple contributing factors before converging.

The model card's bright spots tell the same story from the other side. Signal gathering works because the task is well-defined: find outliers, surface anomalies, report what's breaking. The London team's success happened because there was a clearly defined outcome and the model could follow a structured path - commits, docs, other environments. In both cases, the model excelled when the harness around it was effectively constraining and directing its work.

The implication is clear: the next leap in AI reliability won't come from a better model. It'll come from better harnesses - systems that let the model gather signals from the real environment, preserve context across hypotheses, and work safely against production as it exists right now, not five minutes ago.

What this means for teams running AI in production

The Mythos model card is one of the most honest assessments we've seen from a frontier lab about where their model actually stands for production reliability work. Most releases lead with benchmarks and leave the limitations as fine print. Anthropic's reliability team put theirs front and center, and the industry is better for it.

Here's what we take away:

Signal gathering is getting solved. The bottleneck in the first minutes of an incident is shifting from "find what's breaking" to "decide what to do about it." Teams should start thinking about their AI-assisted investigation workflows now, because the models that power them are improving faster than the infrastructure around them.

Autonomous reliability is not here yet. Mythos is the most capable coding model released to date, and Anthropic's own team says it can't be left alone with generic mitigations. If the best model in the world needs guardrails, so does every agent you're running in production.

The harness is where the leverage is. The Meta-Harness research shows that optimizing what surrounds the model - context management, structured workflows, retrieval, evaluation - delivers outsized gains in accuracy, cost, and reliability. Teams that invest in their harness infrastructure will get more out of every model generation, not just this one.

The reliability layer matters more, not less. As models get more capable, they'll be deployed in more critical paths. The gap between "this model is amazing at finding bugs" and "this model can safely remediate incidents autonomously" is exactly where reliability engineering lives. That gap isn't shrinking - it's becoming the most important problem to solve.


We built Dalton because we saw this coming. Not a future where AI replaces reliability engineers, but one where AI capabilities outpace the infrastructure needed to run them safely. Mythos is the most powerful proof point yet that the model isn't the bottleneck - the harness is. The teams that build reliability into their AI stack now, that invest in the infrastructure layer around these models, will be the ones that can actually use them when they're ready.

Itamar Knafo

Itamar Knafo

Co-founder & CEO

All posts