Opus vs GPT on Real Ops, Part 2: One Drove, One Was Driven
A user could not sign up on an Android phone. The entire brief was a first name and "a Google device". SEV-3, an eight-minute session, 227 clicks, rageclicks included.
The trap: a failed signup is anonymous. identify only fires on success, so there is no email, no username, no user ID in analytics. Nothing to grep for.
I gave the same incident to two engines with the same repo, credentials and skills: Opus 4.8 on Claude Code, GPT-5.5 on Codex (part 1 of this series, if you missed it). Opus ran the whole investigation on its own. GPT needed a human to steer it three times, including being told which tool to use.
OPEN THE FULL SIDE-BY-SIDE INVESTIGATIONevery step, who drove it, how deep each engine went
The short version
Opus, zero nudges. Realised on its own that an abandoned signup never fires identify, triangulated the anonymous session from time, platform and registration events, decoded the PostHog replay blobs, confirmed the duplicate account in Supabase, proved the reset email never sent, and pulled the root cause out of an unmasked DOM field. One prompt in; root cause out.
GPT, three nudges. Solid once aimed, but a human had to tell it the session might be anonymous, point it at the replay skill, and ask whether the reset actually went through. It stopped at "request accepted (200), completion not observed". True, and the wrong question: a 200 from the reset endpoint is deliberate anti-enumeration and fires for any address. A 200 is a politeness, not a fact. Opus proved non-delivery across three layers (database trigger, audit log, mail provider) with a control user to validate the method.
| Opus 4.8 | GPT-5.5 | |
|---|---|---|
| Human-in-the-loop | 0 nudges | 3 interventions |
| Reached the replay by | Its own inference | Being pointed at the skill |
| Root cause | Gmail dot-variant typo | "Duplicate account", not traced further |
| Reset email | Proven never sent | Accepted the 200 at face value |
The dot
Gmail ignores dots in the local part, so both spellings reach the same inbox. The auth database compares raw strings, so they are two different users.
real: <name>.<word>NN@gmail.com <- dot AFTER the name
typed: <name><word>.NN@gmail.com <- dot BEFORE the number
One misplaced dot explains the ten failed logins, the dead password reset, and why "already exists" still fired (autofill supplied the correct spelling only on the register screen). From where the user sat, her email was simply her email. She was right, and locked out anyway.
So what
Part 1's split was how they think. This round's split is who drives, and that one costs more in ops, because the scarce resource at 2am is human attention, not tokens. The engine that drove itself was also the one that refused to stop at a 200, and half the fixes I shipped only exist because of it: you cannot ship "fix the dot UX" if you never found the dot.
The full run-by-run breakdown has every step, every nudge in place, and the replay-decoding technique. Kimi and GLM still owe me a round.