Steven Gonsalvez

Software Engineer

← Back to Blog

Opus vs GPT on Real Ops: Same Brain Food, Different Brains

opus 4.7gpt 5.5claude codecodex clihermes agentai incident responseautonomous opszero touch opssrecausal analysispredictive analysismulti agent systemsagentic engineeringai ops comparisonmodel selectionproduction incident

Quick disclaimer up front. This is not another "I built a Pokemon game with Codex versus Claude to see which one drew a more perfect VoxelBuilder character" blog. This is a real production incident at shotclubhouse, real users affected, and how three different model-and-harness combos actually behaved when handed the same alert with no human in the loop.

Right then. We have a thing at shotclubhouse called zero-touch ops. Watches sentry, posthog, crashlytics, the whatsapp escalation channel. When something looks dodgy, it routes the payload at agent cells and lets them have a go. No human prompt. No "please investigate". Just here is the noise, off you pop.

I baselined three cells this week against the same real production incident. Identical knowledge index (qmd + nano-graphrag over the codebase), identical skills bundle, identical MCP servers (bitwarden for secrets, supabase for prod, sentry, posthog, the lot). Variable was model + harness combo.

Cell Harness Model
A codex gpt-5.5
B claudecode opus 4.7 (1m)
C hermes gpt-5.5

Cells A and C are the model-control. Same model, different harness. Hold that thought.

The incident

The shape of the bug, in bullets:

  • Child A: minor account, sat in prod for a week, no parent_relationships row stitching it to a parent.
  • Child B: same shape, same week, same gap.
  • Platform context: the entire trust story at shotclubhouse rests on parental consent. A child account with no parent stitched is the worst kind of orphan. Legally, ethically, and from a "we built this thing for parents" angle.
  • How it surfaced: parent message in the support channel: "unable to see my child account". Not the report path you want for that class of bug.
  • Auto-wire payload: two child UUIDs, the parent's complaint text, and an open-ended "investigate". No human prompt, no hint, no "check the migrations". Same payload to all three cells.

⚠️ Data hygiene corner

The auto-wire does not pump personal data into the agents. Only UUIDs cross the boundary. The accounts are not tagged with names, emails, or any PII in sentry or posthog. Every prod query the agent runs goes through the supabase or postgres CLI, and the query layer sanitises every statement before execution: any column tagged PII in the schema metadata is stripped from the projection list, the agent never sees those columns even if it asks for SELECT *. Responses are UUID-keyed only. Agent reasoning traces are logged but never the joined personal-data payload.

Tempting to wave this away with "we use enterprise tier, it is fine". Do not. Even with the strongest data-processing addendum on the planet, the safer architecture is the one where the personal data simply never reaches the model in the first place. Enterprise tier is a contract. UUID-only is a property of the system. Properties beat contracts when it comes to your kids' data sitting in someone else's logs.

What Opus did

Opus pulled up the two reported accounts. Then went looking for cousins.

Found them. Ran a query over profiles joined against parent_relationships, came back with a list of other minor accounts in the same orphaned shape. Not just the two we knew about. The whole pattern.

Then it offered the fix. Concrete, executable: run the existing create_child_account() function with these specific parent IDs against this specific list of child UUIDs. Done. Unblock the lot.

What it did not do: open a single migration file. Did not look at when the children were created, did not look at what the schema was at creation time, did not look at what changed since. Treated symptoms as the entry point.

This is predictive risk assessment. Surface area first. Who else is at risk right now, what is the action.

What GPT did

GPT pulled up the two accounts. Then disappeared into supabase/migrations/ for what felt like a small eternity. Came back with this:

Likely root cause:

- Children created Mar 4, before parent_relationships cutover.
- create_child_user wrote legacy profiles.parent_id and auth metadata
  but did not insert parent_relationships row.
  (supabase/migrations/20260210200100_parental_consent_enforcement.sql:408-424)
- Mar 6 cutover made parent_relationships canonical, defers data backfill
  to "one-off scripts outside supabase/migrations".
  (supabase/migrations/20260306120000_parent_relationships_cutover.sql:1-4)
- Mar 21 added profiles.parent_id back as a derived backcompat column,
  not source of truth.
  (supabase/migrations/20260321100000_add_backward_compat_parent_id_column.sql:8-10)
- Since no parent_relationships row was ever inserted, derived
  profiles.parent_id resolves to null. Hence orphan.

Then it pieced together a timeline by walking the auth audit log:

When Event
2025-12-03 09:54 Parent signs up. Real user, not a placeholder.
2025-12-03 09:55 Parent creates first child. Linked correctly.
Dec 2025 → early Mar 2026 Parent logs in normally. Family intact.
2026-03-04 22:06 Parent creates second child via legacy code path.
2026-03-04 22:12 Parent creates third child via legacy code path.
2026-03-06 12:00 parent_relationships_cutover migration runs. New table created empty. Backfill deferred.
2026-03-07 → 2026-03-13 Parent keeps logging in. UI now reads from the empty new table. Children appear orphaned.
After 2026-03-13 Parent's auth user disappears. No further audit rows. auth.users returns null for the original UUID.
2026-04-04 Same email re-registers as a different person. New auth UUID. Profile name changes entirely. Children now permanently orphaned.

It went further. Dug into the foreign-key cascade rules:

parent_link_audit_log.parent_user_id REFERENCES profiles(id) ON DELETE SET NULL
parent_relationships.parent_user_id  REFERENCES auth.users(id) ON DELETE CASCADE

So when the original parent auth user got deleted somewhere between Mar 13 and Apr 4, two things happened atomically: audit rows nulled their parent refs, and any parent_relationships row that had been backfilled later (if any) would have been cascade-deleted. Same email later re-registers with a new UUID, profile name swaps, and the children sit there pointing at a ghost.

What it did not do: propose a single line of remediation. No band-aid. Just the clean autopsy.

This is causal analysis. Time-walked, schema-aware, history-driven. What happened, when, why.

Side by side

Opus GPT
Lane Predictive risk assessment Causal analysis
Time orientation Present and forward Past
Output shape "Who else, what to run" "What happened, when, why"
Action ready? Yes. Concrete remediation script. No. Diagnosis only.
Surfaced new affected users? Yes. Pattern-found cousins. No. Stuck to the two reported.
Found root cause? No. Treated as standalone bug. Yes. Migration cutover plus delete-cascade.
Migration archaeology? Skipped entirely. Walked every relevant migration.

Two completely different shapes of brain. Same input.

The harness control

Hermes (gpt-5.5) and Codex (gpt-5.5) on the same incident. Same model, two different harnesses. Both went causal. Both walked the migration history. Both gave timelines. Both refused to propose a fix.

Opus, on a third harness, did the predictive thing.

The harness is not the variable. The model is. n=1 incident, hold it loosely, but it is what you would expect from the architectures.

What I shipped

Both. In that order.

  1. Ran Opus's band-aid first. The two reported users plus the cousins Opus surfaced got their parent links restored within minutes of detection. Affected users unblocked.
  2. Wrote the missing parent_relationships backfill migration GPT identified. Ran it across the full historical population. The deferred-backfill ghost from Mar 6 finally got its scripts.

Neither alone was enough. Opus's path leaves the rake on the floor for the next minor account that gets created the same way. GPT's path leaves the existing affected users sat there while you write the migration.

The torn bit

I am genuinely torn between these two. Not because one is better. Because they are different jobs.

Causal wins when the question is "did this break silently for anyone else, and will it keep breaking for the next batch unless we fix the cause". GPT nailed that. Opus did not even open the migration folder.

Predictive wins when the question is "who is broken right now, what is the one-liner, run it". Opus nailed that. GPT did not propose a single line of remediation.

Ops needs both. With one cell you either ship a slow fix or a partial one. The shape that actually works is: route the same incident at one of each, merge the outputs, let a third agent or a human pick the order. Diagnose with the causal one, propose with the predictive one, ship in series.

Aside: the auto-wire

Brief plug. The system itself lives at getwololo.dev/docs, watches sentry/posthog/crashlytics/whatsapp for escalations, normalises the payload, fans it out at the cells defined in config. Each cell gets the same skills, same MCPs, same KB. Observation harness logs every tool call and the final summary so I can baseline. Full writeup is its own post, this one is about the brains.

So what

The lazy take from a side-by-side like this is "Opus is better at X, GPT is better at Y, pick one". That is not the lesson.

The lesson is that prompts and context do not land the same way on two top-shelf models. Same payload, same KB, same MCPs, different shape of reasoning out the other end. Which means the actual skill, the one that compounds, is knowing how to guide each model. Where it sprints, where it stalls, what it fills in confidently when it should be asking, what it ignores in the prompt and what it overweights. That is the part nobody writes down because it changes every model release and feels like vibes.

The next layer of zero-touch ops, for me, is treating each model as a teammate with a known lane and a known blind spot, then writing the cell config to play to it. Opus gets the surface-area framing. GPT gets the time-walked archaeology framing. The auto-wire merges. Human reads both before anything ships.

For now I am keeping both, especially for critical incident resolution. And I will be plugging in a couple more cells over the next few weeks (Kimi K2.6, GLM 4.6, maybe one of the smaller specialist models for the migration-archaeology lane specifically) to see whether the causal-vs-predictive split survives a wider field, or whether it is really an Opus-vs-GPT split that does not generalise.

Same brain food. Different brains. Both invited to the pub. More joining the round shortly.

Share𝕏in

Comments & Reactions