Opus vs GPT on Real Ops, Part 2: One Drove, One Was Driven

June 4, 2026|3 min read

opus 4.8 gpt 5.5 claude code codex cli ai incident response autonomous ops zero touch ops posthog session replay rrweb decoding human in the loop agent autonomy production incident supabase auth password reset debugging gmail dot trick

A user could not sign up on an Android phone. The entire brief was a first name and "a Google device". SEV-3, an eight-minute session, 227 clicks, rageclicks included.

The trap: a failed signup is anonymous. identify only fires on success, so there is no email, no username, no user ID in analytics. Nothing to grep for.

I gave the same incident to two engines with the same repo, credentials and skills: Opus 4.8 on Claude Code, GPT-5.5 on Codex (part 1 of this series, if you missed it). Opus ran the whole investigation on its own. GPT needed a human to steer it three times, including being told which tool to use.

OPEN THE FULL SIDE-BY-SIDE INVESTIGATION
every step, who drove it, how deep each engine went

The short version

Opus, zero nudges. Realised on its own that an abandoned signup never fires identify, triangulated the anonymous session from time, platform and registration events, decoded the PostHog replay blobs, confirmed the duplicate account in Supabase, proved the reset email never sent, and pulled the root cause out of an unmasked DOM field. One prompt in; root cause out.

GPT, three nudges. Solid once aimed, but a human had to tell it the session might be anonymous, point it at the replay skill, and ask whether the reset actually went through. It stopped at "request accepted (200), completion not observed". True, and the wrong question: a 200 from the reset endpoint is deliberate anti-enumeration and fires for any address. A 200 is a politeness, not a fact. Opus proved non-delivery across three layers (database trigger, audit log, mail provider) with a control user to validate the method.

	Opus 4.8	GPT-5.5
Human-in-the-loop	0 nudges	3 interventions
Reached the replay by	Its own inference	Being pointed at the skill
Root cause	Gmail dot-variant typo	"Duplicate account", not traced further
Reset email	Proven never sent	Accepted the 200 at face value

The dot

Gmail ignores dots in the local part, so both spellings reach the same inbox. The auth database compares raw strings, so they are two different users.

real:   <name>.<word>NN@gmail.com   <- dot AFTER the name
typed:  <name><word>.NN@gmail.com   <- dot BEFORE the number

One misplaced dot explains the ten failed logins, the dead password reset, and why "already exists" still fired (autofill supplied the correct spelling only on the register screen). From where the user sat, her email was simply her email. She was right, and locked out anyway.

So what

Part 1's split was how they think. This round's split is who drives, and that one costs more in ops, because the scarce resource at 2am is human attention, not tokens. The engine that drove itself was also the one that refused to stop at a 200, and half the fixes I shipped only exist because of it: you cannot ship "fix the dot UX" if you never found the dot.

The full run-by-run breakdown has every step, every nudge in place, and the replay-decoding technique. Kimi and GLM still owe me a round.

Share𝕏 in

Steven Gonsalvez

Opus vs GPT on Real Ops, Part 2: One Drove, One Was Driven

Comments & Reactions