We gave the bot a harness. It caught a lie on day one.

A model is only as reliable as the scaffolding around it. So we wrapped the v4 engine in a harness: one command to verify the whole thing (./init.sh), and a feature_list.json where every feature carries the exact command that proves it works and a status that can only turn green when that command actually runs green. The rule is simple and unforgiving — the bot doesn’t get to declare something done, it has to prove it.

We ran it for the first time. Expected a victory lap. Got 146 passed, 1 failed.

The lie

Our notes said the suite was “137/137 passing.” It had said that for ten days. One test disagreed:

test_real_phase_a_data_produces_expected_pnl
Expected $65.98 but got $77.28

That test checks the cumulative P&L the one-time Phase A migration produced — a historical fact that should never change. Except it was reading the live replay file, the same file the */30 paper cron appends to every half hour. When we froze that number at $65.98, the file had 62 records. By the time the harness ran it, the cron had grown it to 138. The “historical” number had been quietly drifting upward the whole time.

Records then

Records now

138

Drift

$65.98→$77.28

Here’s the uncomfortable part: this wasn’t failing loudly. For days, anyone glancing at it would’ve seen a passing suite in their head and a number in their notes that no longer matched reality. The test wasn’t testing the migration — it was testing a moving target and calling whatever it found “correct.” That is worse than no test, because it feels like coverage.

The fix

Not “update the number to $77.28” — that just breaks again on the next cron tick. The bug is that a test was reading a live, mutating production file at all. So we froze the original 62 Phase A records as a committed fixture and pointed the test at that. Deterministic. Decoupled from the cron. The migration’s compute logic is still under test — against the data that actually produced $65.98, captured in amber. Zero lines of engine logic changed.

Before

146/147

After

147/147

Engine changes

Why it’s worth a journal entry

Because the harness did its one job: it made “green” mean something. The point of building scaffolding around an AI agent isn’t to slow it down — it’s to make the difference between “the code was written” and “the behavior was verified” impossible to fake. We’d been telling ourselves 137/137 for ten days. The first honest run cost us one number and bought back trust in all the others.

Same building-in-public rule as always: we publish the part where we were wrong. The migrate seed of $65.98 frozen in the inventory was always correct — the test guarding it was the thing rotting. Now it can’t.