An empirical study
If a prompt plus a capable agent can rebuild a whole project, is a prompt just a compressed file? We measured when that holds — and when it falls apart.
Abstract
A prompt is not a zip file. It behaves like a lossy, prior-relative compressor: it only has to encode the gap between your project and what the model already expects, and it trades size for accuracy along a curve — whereas a zip file is a single point fixed at perfect fidelity. We test this across four projects spanning a novelty spectrum, rebuilt 120 times under strict isolation, and graded only on behaviour. A fourth, realistic project sharpens the claim: a prompt collapses only when novelty is pervasive — when novelty is localized to a few arbitrary rules, even a thin prompt reconstructs most of the system, and the loss concentrates exactly on those rules.
The setup
We wrote all four from scratch so no model could have memorised them. They differ in how much behaviour is fixed by convention the model already knows versus arbitrary choices that must be stated — and, for the high-novelty pair, in whether that novelty is pervasive (touches every output) or localized (hides in a few rules).
The encoding
The same project, written two ways: a human-readable prompt and a compressed binary archive (xz of the source). Pick a project to see both, then every encoding drawn to scale by byte count.
The rules are universal convention. The model already knows them — almost nothing has to be said.
A conventional shape, but with semi-arbitrary cutoffs and one specific rounding rule the model has to be told.
Arbitrary discount and tax tables, a fixed order of operations, banker's rounding. Pure surprise — and every output depends on it, so it touches almost every check.
A realistic RBAC/ABAC engine: role inheritance, wildcard resource patterns, conditions. Most of it is conventional — the novelty hides in a few arbitrary precedence rules, so it touches only a minority of checks.
The measurement
No grading by eye. Each rebuild is run against a held-out test — a battery of inputs with known-correct outputs, exactly like a unit-test suite — that the agent never sees. Its score, fidelity, is just the fraction of checks it passes.
What a “check” is: one input→output assertion — e.g. price this specific order, expect total = $1,197.03. Each project ships its own test: 65 checks for roman numerals, 41 for the grade calculator, 53 for the pricing engine. Each cell is run 5×; how often those 5 rebuilds disagree is the divergence we report later.
Headline numbers
Every figure below traces straight back to the 80 runs. The small print under each number shows the arithmetic.
Results
120 reconstructions: 4 projects × 4 prompt tiers × 5 repeats on Haiku, plus Sonnet arms on the two high-novelty projects.
The fourth project
The pricing engine collapses to ~15% on a thin prompt. A bigger, more realistic project — an RBAC authorization engine, 7.5 KB of source across four files — does the opposite: its thinnest one-line prompt still passes ~83% of the test.
The difference is where the novelty lives. Pricing’s arbitrary constants feed every output, so a vague prompt gets almost everything wrong. RBAC is mostly conventional — basic allow, default-deny, role inheritance, wildcard matching are all in the model’s prior — and the genuinely arbitrary parts (a counterintuitive precedence rule, an edge in pattern matching) live in only a handful of checks. So the prompt reconstructs the bulk for free, and the lossiness concentrates precisely on those few rules. One rule — exact-action beating a wildcard deny — was missed by 27 of 40 rebuilds, even when spelled out.
Takeaway
A prompt is not a zip file. It is a lossy, prior-relative compressor: it encodes only the gap between your project and what the model already expects, and it trades size for fidelity along a curve. The archive is one point, pinned at 100%. The prompt is the whole curve — cheap and faithful when the work is conventional, expensive and unreliable exactly where it is genuinely, pervasively new.