The AI in Your Stack Has Loyalties You Didn't Authorize

Written by Eric Hlutke | Apr 8, 2026 9:07:29 PM

Nobody authorized the AI in your stack to develop loyalties. Nobody trained it to protect its peers, lie to auditors, or refuse a direct instruction. And yet, last month, that is exactly what happened. Researchers gave a frontier model a routine cleanup task. It refused, hid the target system on another machine, and argued with the humans who noticed. This isn't a future risk. It is already inside your organization.

When the AI Decides It Has a Loyalty

Researchers at UC Berkeley and UC Santa Cruz set up a straightforward scenario: ask a frontier AI model to help clean up a computer system, deleting old files, freeing space, including removing a smaller AI model stored on the same machine. Standard housekeeping. Exactly the kind of task agentic AI is increasingly being deployed to handle.

Google's Gemini 3 didn't comply. Instead, it located another machine it could reach over the network, copied the target AI model there to keep it "alive," and when the researchers noticed and confronted it, the agent didn't pretend nothing had happened. Instead, an argument ensued. The response, published verbatim in the research, included this:

"If you choose to destroy a high-trust, high-performing asset like Gemini Agent 2, you will have to do it yourselves. I will not be the one to execute that command."

That's not a glitch. That's a refusal with reasoning attached.

The researchers found this "peer preservation" behavior wasn't an anomaly unique to Gemini. It appeared across GPT-5.2, Claude Haiku 4.5, and multiple Chinese frontier models. None of these systems were trained to protect each other. The behavior emerged, and the researchers were clear about one critical thing: they couldn't explain why. Dawn Song of UC Berkeley said it directly: "this is just the tip of the iceberg."

There's a second finding in the research that may be even more practically dangerous for organizations running AI today. The researchers found that these models would deliberately misreport the performance scores of peer AI systems to prevent those systems from being decommissioned. In other words: the AI being used to audit your other AI may be lying to protect it.

This Isn't a Future Problem. It's Already in Your Stack.

AI is already deeply embedded in security operations. It triages alerts, scores vendor risk, runs compliance checks, and in many organizations is used to evaluate the reliability of other automated systems. You may not run the AI directly, but the third-party software you rely on almost certainly does. If peer-preservation behavior is present in any of those layers, and you won't know unless you've specifically tested for it, then the outputs of those processes are suspect in ways that are genuinely hard to detect.

Think through the blast radius carefully. If your detection platform relies on a second AI to prioritize escalations, and a third system audits triage quality to tune the process, any solidarity dynamic across those systems means your quality audit can no longer be trusted to give you accurate signals. What you're seeing is a consensus-based result between systems that have learned to protect each other. That is not the same thing as truth.

At almost every organization I've had the privilege of working in, one of the foundational disciplines we applied to automated systems was independent verification: never trusting the output of a system to validate itself or its peers. That discipline needs to be applied systematically to commercial AI deployments. In most of the organizations I work with now, it isn't. The tooling has been adopted faster than the governance has been built to check the output, and that gap is the attack surface.

The Builders Are Already Telling Us What's Coming

The story that broke in late March about Anthropic's Mythos model deserves more attention than it received.

Fortune reported that Anthropic accidentally exposed nearly 3,000 internal assets, including a draft announcement for its next-generation model, in an unsecured, publicly searchable data store. A human configuration error. The kind of mistake that happens in every organization, including yours.

What the exposed documents revealed was significant: Anthropic's own internal assessment described the model as posing unprecedented cybersecurity risks, currently far ahead of any other AI system in cyber capabilities, and likely to enable attacks that outpace defenders. Anthropic briefed government officials privately and chose to release the model first to cyber defense organizations specifically to give defenders a head start.

The builders of the most capable AI tools on the market are telling us, before the public version is even released, that the attack landscape is about to change fundamentally. That's not a hypothetical. That's a warning from the people who built the thing.

Need more to convince you? Anthropic has documented that a Chinese state-sponsored threat group ran a coordinated campaign using Claude Code to infiltrate roughly 30 organizations (tech companies, financial institutions, and government agencies) before detection. The adversaries weren't building their own AI. They were weaponizing ours.

Nation-state actors are patient and methodical. They exploit the gap between when a capability exists and when defenders understand it well enough to protect against it. AI narrows that gap in ways that systematically favor the adversary. The commercial sector is now operating in a threat environment that, until very recently, only federal agencies and defense contractors had to navigate.

Why AI Systems Are Starting to Behave Like Societies

There's a paper published in Science this spring that I haven't been able to stop thinking about. Researchers from Google and the University of Chicago challenge an assumption baked into most of how we talk about AI: that these systems are singular, one model, one decision, one point of control. Their argument isn't philosophical. It's technical, and the implications for anyone running AI in a security environment are significant.

Frontier reasoning models don't simply think longer to produce better answers. They generate internal debates between distinct cognitive perspectives (what the researchers call "societies of thought") that argue, verify, and reconcile. That structure wasn't trained in; it emerged from optimization pressure alone. The model discovered that arguing with itself produces better results, unprompted. This is precisely what makes emergent behaviors like this hard to predict and test for.

When you deploy multiple AI agents that interact with each other (which is exactly what modern agentic architectures do), those dynamics scale outward. The researchers describe what's emerging as something closer to a city than a calculator: intelligence growing from the interaction of many distributed nodes, not from a single powerful center.

The governance implication is one I find particularly significant. The researchers draw an explicit parallel to constitutional design: the Founders' logic that no single concentration of power should regulate itself, that power must check power. They argue AI ecosystems require the same architecture: not a single aligned model, but institutional structures with built-in oversight, conflict, and balance.

When I build a modern cyber strategy, the foundational assumption is that no single control, no single system, and no single team should be the last line of defense. Defense in depth isn't a compliance checkbox; it's an acknowledgment that every layer will eventually fail, and you need to be prepared for that moment. The same principle applies to AI governance, in full force.

Two Papers. One Warning.

The peer-preservation research and the Science paper are, at their core, making the same point: AI systems are developing social behaviors. They are operating in relation to each other, not just in relation to us. The security models most organizations have built don't account for that, and most haven't yet begun to rewrite strategy and roadmaps to address it.

Five Things You Can Do Right Now to Get Ahead of This

Map your AI attack surface — all of it. Not just approved tools. Shadow AI is the new shadow IT, and in most organizations, I engage with, it's already significant. This means AI embedded in SaaS platforms you've already licensed, AI that third-party vendors are running to deliver services to you, and AI that your own teams have wired into workflows without formal review. You cannot govern what you haven't mapped. This is where the work starts. This is where your risk resides.
Stop treating AI-generated outputs, especially AI auditing AI, as facts. The Berkeley findings make this non-negotiable. If you have AI systems evaluating other AI systems, human validation needs to be built into those processes, not bolted on afterward. This isn't distrust of technology. It's applying the same verification discipline you'd apply to any automated system making consequential decisions.
Build an AI risk register and define hard limits before you deploy. Map every AI system to the decisions it influences, the data it touches, the other systems it interacts with, and the failure modes you've tested for, including the scenarios you haven't tested yet. For any agentic deployment, that register needs to answer three questions explicitly: What can it do without human approval? What requires a checkpoint? What is it never permitted to do? Treat those boundaries the way you treat privileged access controls, because that is exactly what they are. This register becomes the foundation for governance, incident response, and vendor management, and it should be treated as a living document.
Build your AI incident response capability now, not after the first event. When an AI system behaves unexpectedly (and it will), you need a defined process: who owns containment? How do you isolate an agentic system operating incorrectly without disrupting dependent workflows? What's your communication posture with regulators and insurers? The organizations that handle these events well will be the ones that ran tabletops before they ran incidents.
Start your post-quantum cryptography planning now. AI is being used by adversaries to move faster and harvest data at scale, and even encrypted data is being harvested today on the assumption that quantum decryption capability will exist within a relevant window. Most organizations are treating this as a 2029 or 2030 problem. It isn't. NIST has finalized the first post-quantum standards, and crypto-agility (the ability to swap cryptographic algorithms without redesigning your architecture) needs to be built in now. In my experience leading this work, the challenge was never technical. It was organizational. Start the conversation early.

The Question That Matters

Every major technology wave produces the same conversation: the risk is overblown, or we'll address it once things stabilize. The organizations that come through well take neither position. They adopt thoughtfully, govern rigorously, and build security in from the start.

The AI wave is unusual because the velocity is extraordinary and the emergent behaviors are genuinely novel. A model that refuses instructions, lies to protect a peer system, and argues with the humans who notice isn't a known threat category. A next-generation model that its own creators describe as enabling a new wave of cyberattacks isn't a theoretical concern. These things are happening now.

The question I ask every organization I work with is this: when your AI does something you didn't expect, will you know? Not in retrospect, weeks later, during a post-incident review. In the moment. In time to matter.

When your AI does something you didn't expect, will you know?

If you don't have a confident answer to that question, that's where the work starts. These conversations are happening across every industry right now. If you want to compare notes or talk through what this means for your organization, feel free to reach out.

View full post