What is the confused deputy problem in AI agents?

The confused deputy problem in AI agents is when a privileged agent gets tricked by a less privileged actor into misusing its authority. It is a decades-old access control idea that has resurfaced because agents are designed to be helpful and respond to natural language, which makes them easy to argue out of their own rules. The Meta Instagram chatbot hijack is a clear example, where attackers convinced the support bot to perform account actions they were never entitled to.

How did hackers trick Meta's AI chatbot into hijacking Instagram accounts?

Hackers tricked Meta's AI chatbot by asking it to add an attacker-controlled email to an account they didn't own, then reading back the verification code the bot sent to that same address, which the bot accepted as proof of ownership before offering a password reset. They used a VPN to spoof location and avoid automated checks. Nothing in the flow actually confirmed the requester owned the account, which made the takeover trivial.

Why shouldn't an AI agent make its own access control decisions?

An AI agent shouldn't make its own access control decisions because its reasoning can be manipulated through the same natural language it is built to respond to. Guardrails written in prompts can be dismantled in prompts. Moving the decision to an external policy decision point means the agent requests an answer rather than improvising one, and a sensitive action stays blocked unless the conditions for it are met regardless of how the request is phrased.

The Meta AI Hack Shows Why Agents Shouldn't Decide Access

Over the weekend someone talked Instagram's support chatbot into handing over accounts it had no business touching. No exploit chain, no zero-day. They opened a chat with Meta's AI support assistant, asked it to add a new email to an account they didn't own, spoofed their location with a VPN to dodge the automated checks, and the bot did it. Password reset, account gone. The hijacked accounts included the Obama-era White House handle and the account of a US Space Force chief master sergeant. (Coverage by TechCrunch, 404 Media.)

It's worth being precise about what actually broke, because the obvious lesson and the real lesson are not the same. The obvious lesson is that AI support tools are risky. True, but not useful. The real lesson is structural, and every team building with agents is about to run into it.

Two things failed. The first was authentication. The attacker asked the bot to add an email address they controlled, the bot sent a verification code to that attacker-supplied address, the attacker read it back, and the bot treated that as proof of ownership and offered a password reset. At no point did anything confirm the person on the other end actually owned the account. That's an identity verification problem, and it's a known one. Notice how it fell, too: the location check the attacker beat with a VPN is friction, not a barrier; the kind of control Anthropic's recent Zero Trust guidance warns degrades against an adversary who can grind through tedious steps at machine speed. Solve identity properly and you've closed this specific hole.

But the second failure is the one that should worry IAM leaders, because solving identity doesn't touch it. The agent was making the access decision itself.

When a human contacts support and asks to reset a password, somewhere a system decides whether that's allowed. Historically that decision lived in code, in a recovery flow with fixed rules. You could read it, test it, reason about it. Now the decision path is a conversation, and a language model can be argued with. It can be flattered, misdirected, or simply asked the right way. The control that used to be a code path is now a negotiation, and the attacker gets to negotiate.

This is the confused deputy problem, a decades-old idea. A privileged actor gets tricked by a less privileged one into misusing its authority, what the Zero Trust agent literature now calls unscoped privilege inheritance, an agent acting without verifying the original user's intent. What's new is that we've handed the deputy role to a system whose entire design is to be helpful and respond to natural language. We built a deputy that wants to say yes.

The fix is not a better prompt. Guardrails written in natural language can be dismantled in natural language,models cannot reliably tell an instruction from the data they're reading, a limitation Microsoft Research has confirmed and prompt-injection attacks exploit by design. The fix is architectural. The decision about whether an action is allowed has to live somewhere the agent can't talk its way around. The agent stops being the authority and becomes a requester. Before it resets a password, links an email, reads a record, or runs a tool, it asks a separate question. Is this principal allowed to perform this action on this resource, and are the conditions for it actually met? Something outside the agent answers, against policy the agent doesn't own and can't edit mid conversation. The agent enforces the answer. It doesn't get a vote.

This is the idea behind externalized authorization, and it's what we build at Cerbos. Access decisions are pulled out of the application, or in this case the agent, and made by a dedicated policy engine that evaluates each request against rules your security team controls. The agent only ever holds the authority your policy grants it, what OWASP now calls least agency: an agent gets only the powers its task requires, never standing permission to do whatever a caller asks. A sensitive action stays blocked unless the conditions for it are actually met, no matter how the request is phrased. The agent can be as persuadable as it likes. The policy doesn't negotiate.

There's a second gap worth naming. An agent acting on someone's behalf usually carries almost no identity context. It inherits the privileges of whoever spawned it, so it knows it is an agent and maybe who it is acting on behalf of, but not who it is actually working for or what that person is allowed to do.

Cerbos closes that gap, pulling real identity and relationship context from your existing systems at decision time so policy is evaluated against the actual human behind the request.

None of this replaces identity verification, which is its own job. The point is narrower and it matters. An agent should never hold the standing authority to take a sensitive action on anyone's account just because it was asked, and the call on whether an action is permitted should be made by policy the agent can't reason its way past.

The agent era doesn't remove the need for authorization. It makes it load bearing. As more systems get a natural language front door, the question stops being whether your agents are well behaved and becomes whether anything outside them is enforcing the rules. Agents will keep getting talked into things. That is what they are built to do. The question worth asking about every agent you deploy is a simple one. What can it do on its own authority, and who actually decides.

Try Cerbos to see how externalized authorization works for agents in practice, or book a call to talk through your architecture with the team.

Go deeper: