2-2 Factor for AI Agents: Multi-Agent Reliability

March 25, 2026|9 min read

ai agents security architecture reliability

Three years ago I wrote about the 2-2 factor for humans. The core argument: no single human should act alone on anything that matters, because you can compose multiple independent humans to achieve reliability levels no individual can reach. The maths is simple multiplication of failure probabilities.

I started properly thinking about whether the same principle applies to AI agents when I was writing the tyranny of small decisions post about a year ago. That post was about how individually rational agent decisions aggregate into collectively irrational outcomes. The 2-2 factor is one answer to that: if no single agent acts alone, the probability of a bad small decision surviving drops from 10% to 1%. Same maths, different application.

This is a brainstorm. Not an architecture doc. Some of this might be completely wrong. But the shape of the idea feels right, so I'm writing it down before I lose it.

The Problem, In Numbers

Let's say an agent has a 10% hallucination rate on average. That's probably generous for complex tasks, conservative for simple ones, but it's a round number so let's go with it.

Not all hallucinations are catastrophic. Most are harmless (wrong variable name, slightly off summary, that sort of thing). Say 10% of hallucinations land in areas where getting it wrong actually hurts: production operations, financial data, security configuration, access controls.

So: 10% hallucination rate x 10% chance it's in a sensitive area = 1% chance of a catastrophic hallucination per activity.

Sound familiar? That's the same 1% error rate I used for a good human engineer. Same maths, different source of errors. Biology for humans, statistical inference for models.

Now scale it. If an agent swarm is performing 500 actions a day (and modern agent systems easily exceed this), that's 5 catastrophic events per day. Thirty-five per week. Over a hundred per month.

You can't ship that. Nobody can ship that.

The Fix Is the Same Fix

And here's where I keep landing: you can't fix this by making the model better. Just like you can't train a human past their biological ceiling, you can't fine-tune a model past its statistical ceiling. You can push the hallucination rate from 10% to 5% to 2%, sure. But you'll never reach zero. And at scale, even a 1% rate produces unacceptable numbers.

The fix, same as it was three years ago, is composition. Multiple independent checks. Multiple agents. Multiple brains (if you can call them that).

Single agent at 90% reliability on sensitive tasks: 10% failure rate.
Two independent agents: 0.1 x 0.1 = 1% failure rate. That's 90% to 99%.
Three independent agents: 0.1 x 0.1 x 0.1 = 0.1%. That's 99.9%.

The maths is identical to the human version. The principle is identical. The implementation is where it gets interesting.

Idea 1: Agent Consensus (Raft-Style)

What if agents used something like Raft consensus before taking sensitive actions?

The pattern: a leader agent proposes an action. Follower agents independently evaluate the proposal. The action only proceeds if a quorum (majority) agrees it's sound.

This maps surprisingly well to how Raft works for distributed log consensus. Leader proposes a log entry, followers validate, entry is committed on majority acknowledgment. Swap "log entry" for "proposed action" and "committed" for "executed" and you've got the same structure.

The appeal: strong consistency guarantees. If 3 out of 5 agents agree the action is correct, and they evaluated independently, the probability of a bad action getting through drops dramatically.

The cost: latency. Every sensitive action requires multiple agent evaluations before proceeding. For a production deploy, that's probably fine (you're already waiting for CI anyway). For high-frequency operations, it might be too slow.

I keep going back and forth on whether full Raft is overkill. Maybe it is. But the bones of it feel right.

Idea 2: Gossip-Based Acceptance

Lighter weight alternative: instead of formal consensus, agents gossip about proposed actions.

An agent proposes an action and broadcasts it. Other agents that see the proposal can flag concerns. If no flags arrive within a timeout window, the action proceeds.

Faster than Raft. Weaker guarantees. But potentially good enough for medium-sensitivity operations where the blast radius is contained.

The interesting bit: in gossip protocols, information propagates probabilistically. Not every agent sees every proposal. So you're trading consistency for speed, which might be exactly the right trade-off for actions where "probably fine" is acceptable but "definitely wrong" would be caught by at least one observer.

The risk: if the timeout is too short, bad actions slip through before anyone flags them. If it's too long, you've lost the speed advantage. Finding the right timeout per action type feels like the hard engineering problem here.

Idea 3: Agent Identity Verification (The PKCE-ish Idea)

This one is the most speculative, so bear with me.

How do you know the agent proposing an action is the agent it claims to be? Prompt injection, context poisoning, jailbreaks: all of these can compromise an agent mid-session without any external signal that something's changed. The agent still looks like itself. It still responds like itself. But its decision-making has been corrupted.

What if we borrowed something from OAuth's PKCE flow?

Here's what I've been sketching:

Agent A generates a random challenge value (like a PKCE code_verifier)
Agent A hashes it and sends the hash to the orchestrator
The orchestrator forwards the hash to a separate validator agent (Agent V)
At action time, Agent V asks Agent A to produce the original unhashed value
Agent V verifies the value against the hash
If Agent A has been compromised mid-session (context poisoned, prompt injected), it can't produce the original verifier because the compromised context doesn't have it

This is hand-wavy. I know. There are obvious holes. Can a compromised agent just generate a new challenge and claim that's the original? Probably, unless the hash was committed to a store the agent can't modify. Can the challenge itself be extracted via prompt injection? Maybe, if the attacker knows to ask for it.

But the shape of the idea, a cryptographic-ish proof that an agent's context hasn't been tampered with between challenge generation and action execution, feels like something worth exploring. Even if the specific mechanism I've described is wrong, the problem it's trying to solve (agent identity and integrity verification) is very real.

Idea 4: Tiered Sensitivity

Not every action needs consensus. That would be madness. An agent reading a file doesn't need three other agents to approve it.

So you classify actions by blast radius.

At the low end (read-only, local workspace changes, fetching documentation), a single agent with no check is fine. The worst case is wasted computation. In the middle (writing to shared resources, modifying non-production configs, sending notifications), you want two-agent approval. One proposes, one validates. The gossip approach from Idea 2 might work here. And at the top end (production deploys, financial transactions, security config changes, access control modifications), you need full consensus plus human escalation. No agent swarm, no matter how reliable, should autonomously execute high-blast-radius actions without a human in the loop. Not yet. Maybe not ever.

The tricky part is classification. Who decides what's high, medium, or low? You could hardcode it (all production writes are high), but that's brittle. You could have agents self-classify, but then a compromised agent just classifies everything as low. You probably need a combination: hardcoded rules for known-dangerous operations, with a separate classifier agent that can't be influenced by the acting agent.

Idea 5: The Anti-Distillation Problem

Here's the one that keeps me up at night.

If Agent A's context has been poisoned, and Agent B reads the same poisoned context, Agent B will reach the same wrong conclusion. You haven't composed independent checks. You've just run the same biased coin twice.

True independence requires actually different inputs:

Different context windows (Agent B doesn't see Agent A's polluted context)
Different prompts (the evaluation prompt is structurally different from the action prompt)
Maybe even different models (GPT checking Claude, Claude checking Gemini, nobody checking their own work)

That last one is interesting. In human systems, we intuitively know that you don't ask the person who wrote the code to review their own code. The reviewer needs to be independent. But with agents, we often run the same model with the same context and call it a "second check." That's not a second check. That's the same check with a different random seed.

For composition to work, for the probabilities to actually multiply, the failures need to be uncorrelated. Same model, same context, same prompt: highly correlated failures. Different model, different context, different prompt: much less correlated. That's where the maths actually holds.

I don't have a clean solution for this. But I think "how do we ensure agent checks are properly independent?" is probably the most important open question in agent reliability.

Tying It Back

The human version of this principle has been around for centuries. Banking figured it out. Aviation figured it out. Nuclear weapons figured it out. Medicine figured it out. The maths hasn't changed.

If a single agent has 90% reliability on sensitive tasks, two independent agents give you 99%. Three give you 99.9%. The multiplication is the same. The principle is the same. The implementation just needs to account for the weird new ways AI systems can fail (context poisoning, hallucination correlation, identity spoofing) that humans don't really deal with.

We're building increasingly autonomous agent systems. Some of them will have access to production infrastructure, financial systems, security configurations. The question isn't whether they'll make mistakes. They will. The question is whether we build the same multi-check, multi-agent, 2-2 factor systems that we already know work for humans.

Or whether we let a single agent push to production at 11pm because it's faster.

Feels like: giving one pilot the controls and sending the co-pilot home because the autopilot is "probably fine." The autopilot is good. It's not good enough to fly alone.

Bottom line: The reliability maths that has protected human systems for centuries applies identically to agent systems. Single agents will hit a reliability ceiling just like single humans do. Composition of independent checks is the only way past the ceiling. We know this. We've always known this. The question is whether we'll bother to apply it before something goes properly wrong.

If you're thinking about how to actually build multi-agent systems that scale, I wrote about that in From Autonomous Agent to Autonomous Factory. And for the original human version of this argument, see The Importance of Two-Two Factors.

Share𝕏 in

Steven Gonsalvez

2-2 Factor for AI Agents: Multi-Agent Reliability

Comments & Reactions