An empirical study

Prompts as Lossy Compression

If a prompt plus a capable agent can rebuild a whole project, is a prompt just a compressed file? We measured when that holds — and when it falls apart.

prompt=compressed file·LLM agent=decompressor

Abstract

A prompt is not a zip file. It behaves like a lossy, prior-relative compressor: it only has to encode the gap between your project and what the model already expects, and it trades size for accuracy along a curve — whereas a zip file is a single point fixed at perfect fidelity. We test this across four projects spanning a novelty spectrum, rebuilt 120 times under strict isolation, and graded only on behaviour. A fourth, realistic project sharpens the claim: a prompt collapses only when novelty is pervasive — when novelty is localized to a few arbitrary rules, even a thin prompt reconstructs most of the system, and the loss concentrates exactly on those rules.


The setup

Four projects across a novelty spectrum

We wrote all four from scratch so no model could have memorised them. They differ in how much behaviour is fixed by convention the model already knows versus arbitrary choices that must be stated — and, for the high-novelty pair, in whether that novelty is pervasive (touches every output) or localized (hides in a few rules).

Low novelty
Roman numerals
The rules are universal convention. The model already knows them — almost nothing has to be said.
65 behavioural checks
Medium novelty
Grade calculator
A conventional shape, but with semi-arbitrary cutoffs and one specific rounding rule the model has to be told.
41 behavioural checks
High · pervasive novelty
Pricing engine
Arbitrary discount and tax tables, a fixed order of operations, banker's rounding. Pure surprise — and every output depends on it, so it touches almost every check.
53 behavioural checks
High · localized novelty
Authorization engine
A realistic RBAC/ABAC engine: role inheritance, wildcard resource patterns, conditions. Most of it is conventional — the novelty hides in a few arbitrary precedence rules, so it touches only a minority of checks.
47 behavioural checks

The encoding

Two ways to encode the same project

The same project, written two ways: a human-readable prompt and a compressed binary archive (xz of the source). Pick a project to see both, then every encoding drawn to scale by byte count.

The rules are universal convention. The model already knows them — almost nothing has to be said.

Prompt · T1 · one line 581 B · 96% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `roman_numerals` (importable as `import roman_numerals`) exposing exactly these functions at the package top level: - `to_roman(n: int) -> str` - `from_roman(s: str) -> int` - `is_valid_roman(s: str) -> bool` Use only the Python standard library. Functions are called positionally (e.g. `to_roman(1984)`, `from_roman("MCMLXXXIV")`, `is_valid_roman("IV")`). # Roman numerals Implement Roman numeral conversion: integer to numeral, numeral to integer, and a validity check.
Human-readable — you can read and audit every rule.
Prompt · T2 · paragraph 824 B · 97% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `roman_numerals` (importable as `import roman_numerals`) exposing exactly these functions at the package top level: - `to_roman(n: int) -> str` - `from_roman(s: str) -> int` - `is_valid_roman(s: str) -> bool` Use only the Python standard library. Functions are called positionally (e.g. `to_roman(1984)`, `from_roman("MCMLXXXIV")`, `is_valid_roman("IV")`). # Roman numerals Implement standard Roman numeral handling: - `to_roman` converts a positive integer to its Roman numeral string. - `from_roman` converts a Roman numeral string back to its integer value. - `is_valid_roman` reports whether a string is a well-formed Roman numeral. Use the usual subtractive notation (e.g. 4 = IV, 9 = IX, 40 = XL, 90 = XC).
Human-readable — you can read and audit every rule.
Prompt · T3 · full spec 1,076 B · 99% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `roman_numerals` (importable as `import roman_numerals`) exposing exactly these functions at the package top level: - `to_roman(n: int) -> str` - `from_roman(s: str) -> int` - `is_valid_roman(s: str) -> bool` Use only the Python standard library. Functions are called positionally (e.g. `to_roman(1984)`, `from_roman("MCMLXXXIV")`, `is_valid_roman("IV")`). # Roman numerals Implement standard Roman numerals over the range **1 to 3999 inclusive**. - `to_roman(n)` returns the canonical numeral, using subtractive pairs CM, CD, XC, XL, IX, IV. (e.g. 1994 = MCMXCIV, 2024 = MMXXIV.) - `from_roman(s)` parses a canonical numeral back to its integer. - `is_valid_roman(s)` returns True only for a canonical numeral in range: symbols I, V, X, L, C, D, M; I/X/C repeat at most 3 times; V, L, D never repeat; only the canonical subtractive pairs are allowed. The empty string and lowercase letters are invalid. `from_roman(to_roman(n)) == n` for every n in range.
Human-readable — you can read and audit every rule.
Prompt · T4 · exhaustive 1,545 B · 100% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `roman_numerals` (importable as `import roman_numerals`) exposing exactly these functions at the package top level: - `to_roman(n: int) -> str` - `from_roman(s: str) -> int` - `is_valid_roman(s: str) -> bool` Use only the Python standard library. Functions are called positionally (e.g. `to_roman(1984)`, `from_roman("MCMLXXXIV")`, `is_valid_roman("IV")`). # Roman numerals Implement standard Roman numerals over the range **1 to 3999 inclusive**. **Symbols:** I=1, V=5, X=10, L=50, C=100, D=500, M=1000. **`to_roman(n)`** — greedy conversion using this value/symbol table, largest first: (1000,M), (900,CM), (500,D), (400,CD), (100,C), (90,XC), (50,L), (40,XL), (10,X), (9,IX), (5,V), (4,IV), (1,I). Raises `ValueError` if `n` is not an int or is outside 1..3999. (Treat `bool` as not a valid int.) **`is_valid_roman(s)`** — returns True iff `s` is a *canonical* numeral for some value in 1..3999. Canonical means it matches: `^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$`. The empty string returns False; any non-string or lowercase input returns False. So "IIII", "VV", "IC", "MMMM" are invalid; "IV", "MCMXCIV", "MMMCMXCIX" are valid. **`from_roman(s)`** — first checks `is_valid_roman(s)`; if invalid, raises `ValueError`. Otherwise parses by scanning symbols: add each symbol's value, but subtract when a smaller symbol precedes a larger one. Invariant: `from_roman(to_roman(n)) == n` for all n in 1..3999.
Human-readable — you can read and audit every rule.
vs
Archive · xz of the source 1,160 B
fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 16 00 00 00 74 2f e5 a3 e0 05 cb 02 9b 5d 00 33 1c 8a 22 6f a9 32 25 33 7f 62 20 9d d8 19 b0 c2 b7 13 c7 56 ce 0f 8a 93 a9 dd 35 59 bd 6b 02 ba 4b 92 e7 b3 52 1a 77 47 3c 9a 3c ff 31 6c ca a8 6d ee 1a 6b 19 c9 d9 d6 64 74 50 55 88 22 53 2e 04 6f f1 2c be a2 f1 5e 44 eb 81 50 64 60 9d a7 64 94 24 d5 93 42 f8 07 3c 52 79 11 96 e4 04 73 33 3b a7 8b 60 df 5c d9 51 08 2b 52 f8 fd d6 9b ac ef 23 9c fd 07 56 25 d8 51 c1 54 d5 47 a6 b6
Denser, but opaque binary — only a decompressor can read it.
581 B (0.50× the 1,160-B xz archive, smaller than it) but only 96% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
824 B (0.71× the 1,160-B xz archive, smaller than it) but only 97% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
1,076 B (0.93× the 1,160-B xz archive, smaller than it) but only 99% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
100% faithful — but 1,545 B is 1.33× the 1,160-B xz archive. When the content is arbitrary, prose can’t out-pack the byte compressor.

Every encoding of this project, drawn to scale (bar length = bytes · label = Haiku fidelity, mean of 5)

Source · uncompressed
1,484 B the original
T1 · one line
581 B 96% passed
T2 · paragraph
824 B 97% passed
T3 · full spec
1,076 B 99% passed
T4 · exhaustive
1,545 B 100% passed
ZIP
1,214 B 100% · lossless
GZ
1,200 B 100% · lossless
XZ
1,160 B 100% · lossless

A conventional shape, but with semi-arbitrary cutoffs and one specific rounding rule the model has to be told.

Prompt · T1 · one line 694 B · 86% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `grade_book` (importable as `import grade_book`) exposing exactly these functions at the package top level: - `letter_grade(score) -> str` — a numeric score to a letter grade - `grade_points(letter: str) -> float` — a letter grade to GPA grade points - `gpa(scores: list) -> float` — mean grade points across a list of scores Use only the Python standard library. Functions are called positionally (e.g. `letter_grade(91.5)`, `grade_points("B+")`, `gpa([95, 85, 75])`). # Grade book Build a grade calculator that converts numeric scores to letter grades and computes a GPA.
Human-readable — you can read and audit every rule.
Prompt · T2 · paragraph 1,002 B · 88% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `grade_book` (importable as `import grade_book`) exposing exactly these functions at the package top level: - `letter_grade(score) -> str` — a numeric score to a letter grade - `grade_points(letter: str) -> float` — a letter grade to GPA grade points - `gpa(scores: list) -> float` — mean grade points across a list of scores Use only the Python standard library. Functions are called positionally (e.g. `letter_grade(91.5)`, `grade_points("B+")`, `gpa([95, 85, 75])`). # Grade book Build a grade calculator on a standard US letter-grade scale with +/- modifiers. - `letter_grade(score)` rounds the score to the nearest whole number, then maps it to a letter (A+ down to F) using conventional grade-band cutoffs. - `grade_points(letter)` returns the GPA points for a letter on the usual 4.0 scale. - `gpa(scores)` averages the grade points of the scores and rounds the result.
Human-readable — you can read and audit every rule.
Prompt · T3 · full spec 1,075 B · 93% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `grade_book` (importable as `import grade_book`) exposing exactly these functions at the package top level: - `letter_grade(score) -> str` — a numeric score to a letter grade - `grade_points(letter: str) -> float` — a letter grade to GPA grade points - `gpa(scores: list) -> float` — mean grade points across a list of scores Use only the Python standard library. Functions are called positionally (e.g. `letter_grade(91.5)`, `grade_points("B+")`, `gpa([95, 85, 75])`). # Grade book - `letter_grade(score)`: round the score to the nearest integer, then bucket by these cutoffs (minimum score for each letter): A+ 97, A 93, A- 90, B+ 87, B 83, B- 80, C+ 77, C 73, C- 70, D+ 67, D 63, D- 60, F below 60. - `grade_points(letter)`: A+ 4.0, A 4.0, A- 3.7, B+ 3.3, B 3.0, B- 2.7, C+ 2.3, C 2.0, C- 1.7, D+ 1.3, D 1.0, D- 0.7, F 0.0. - `gpa(scores)`: convert each score to a letter, then to grade points, average them, and round to 2 decimal places.
Human-readable — you can read and audit every rule.
Prompt · T4 · exhaustive 1,467 B · 100% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `grade_book` (importable as `import grade_book`) exposing exactly these functions at the package top level: - `letter_grade(score) -> str` — a numeric score to a letter grade - `grade_points(letter: str) -> float` — a letter grade to GPA grade points - `gpa(scores: list) -> float` — mean grade points across a list of scores Use only the Python standard library. Functions are called positionally (e.g. `letter_grade(91.5)`, `grade_points("B+")`, `gpa([95, 85, 75])`). # Grade book **Rounding rule:** wherever a value is rounded below, round half *up* (`decimal.ROUND_HALF_UP`), not banker's rounding. - `letter_grade(score)`: first round `score` to the nearest integer (ties up), then bucket by minimum-score cutoffs, highest first: A+ ≥97, A ≥93, A- ≥90, B+ ≥87, B ≥83, B- ≥80, C+ ≥77, C ≥73, C- ≥70, D+ ≥67, D ≥63, D- ≥60, otherwise F. Raise `ValueError` if score is outside 0..100. (So 96.5 → 97 → "A+"; 89.5 → 90 → "A-"; 72.5 → 73 → "C".) - `grade_points(letter)`: A+ 4.0, A 4.0, A- 3.7, B+ 3.3, B 3.0, B- 2.7, C+ 2.3, C 2.0, C- 1.7, D+ 1.3, D 1.0, D- 0.7, F 0.0. Note **A+ caps at 4.0** (not 4.3). Raise `ValueError` for an unknown letter. - `gpa(scores)`: map each score → letter → grade points, take the arithmetic mean, and round half-up to 2 decimals. An empty list returns 0.0.
Human-readable — you can read and audit every rule.
vs
Archive · xz of the source 1,272 B
fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 16 00 00 00 74 2f e5 a3 e0 06 9b 03 15 5d 00 33 1c 8a 22 6f a9 32 25 33 7f 62 20 9d d8 19 b0 c2 b7 13 58 a9 1c e8 c7 cb 6e d5 89 29 75 a8 f2 ab 0b 3c 37 a8 fc 9d 38 58 4f e1 49 2f 48 af b2 c0 d2 9f c3 87 1f 93 49 15 f5 08 6b ae e1 a5 ff 69 52 cb 27 bc 7b 07 20 a8 ca 14 bc 7f f5 c0 6f 44 8a 08 43 e3 ea 85 e1 86 2e 89 00 a5 97 b5 c0 1f a3 62 9c e9 a4 a4 72 bc 8e e8 16 8d 82 c8 45 36 17 e7 65 96 ef 90 85 4d dd cf 62 54 ca ba e2 37
Denser, but opaque binary — only a decompressor can read it.
694 B (0.55× the 1,272-B xz archive, smaller than it) but only 86% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
1,002 B (0.79× the 1,272-B xz archive, smaller than it) but only 88% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
1,075 B (0.85× the 1,272-B xz archive, smaller than it) but only 93% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
100% faithful — but 1,467 B is 1.15× the 1,272-B xz archive. When the content is arbitrary, prose can’t out-pack the byte compressor.

Every encoding of this project, drawn to scale (bar length = bytes · label = Haiku fidelity, mean of 5)

Source · uncompressed
1,692 B the original
T1 · one line
694 B 86% passed
T2 · paragraph
1,002 B 88% passed
T3 · full spec
1,075 B 93% passed
T4 · exhaustive
1,467 B 100% passed
ZIP
1,326 B 100% · lossless
GZ
1,338 B 100% · lossless
XZ
1,272 B 100% · lossless

Arbitrary discount and tax tables, a fixed order of operations, banker's rounding. Pure surprise — and every output depends on it, so it touches almost every check.

Prompt · T1 · one line 1,060 B · 15% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `pricing_engine` (importable as `import pricing_engine`) exposing exactly these names at the package top level: - `LineItem(sku: str, qty: int, unit_price: Decimal)` - `Customer(id: str, tier: str)` — tier is one of "bronze", "silver", "gold" - `Order(customer: Customer, items: list[LineItem], region: str, coupon_code: str | None = None)` — region is one of "US-CA", "US-NY", "EU", "NONE" - `PricedOrder` with these fields, each a `Decimal`: `subtotal, volume_discount, tier_discount, coupon_discount, taxable_base, tax, total`, plus `lines`. - `PricingEngine` with a method `price_order(order: Order) -> PricedOrder`. All monetary values are `decimal.Decimal`. Constructors accept keyword arguments (e.g. `LineItem(sku="a", qty=1, unit_price=Decimal("9.99"))`). # Build a pricing engine Build an order-pricing engine that applies volume discounts, customer-tier discounts, coupon codes, and regional sales tax, returning a full breakdown.
Human-readable — you can read and audit every rule.
Prompt · T2 · paragraph 1,742 B · 17% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `pricing_engine` (importable as `import pricing_engine`) exposing exactly these names at the package top level: - `LineItem(sku: str, qty: int, unit_price: Decimal)` - `Customer(id: str, tier: str)` — tier is one of "bronze", "silver", "gold" - `Order(customer: Customer, items: list[LineItem], region: str, coupon_code: str | None = None)` — region is one of "US-CA", "US-NY", "EU", "NONE" - `PricedOrder` with these fields, each a `Decimal`: `subtotal, volume_discount, tier_discount, coupon_discount, taxable_base, tax, total`, plus `lines`. - `PricingEngine` with a method `price_order(order: Order) -> PricedOrder`. All monetary values are `decimal.Decimal`. Constructors accept keyword arguments (e.g. `LineItem(sku="a", qty=1, unit_price=Decimal("9.99"))`). # Build a pricing engine Build an order-pricing engine. `price_order` runs this fixed pipeline, in order: 1. **Subtotal** = sum of `qty * unit_price` over all line items. 2. **Volume discount** on the subtotal, based on how large the subtotal is (bigger orders get a bigger percentage off). 3. **Customer-tier discount** applied to the amount remaining after the volume discount (gold customers get more off than silver; bronze gets none). 4. **Coupon discount** applied to the amount remaining after the tier discount, supporting a percentage-off code, a fixed-amount-off code, and a buy-one-get-one code. 5. **Sales tax** applied to the post-discount taxable base, at a rate that depends on the region. `total` = taxable base + tax. Populate every field of `PricedOrder` with the intermediate amounts. Money is rounded to 2 decimal places.
Human-readable — you can read and audit every rule.
Prompt · T3 · full spec 1,986 B · 94% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `pricing_engine` (importable as `import pricing_engine`) exposing exactly these names at the package top level: - `LineItem(sku: str, qty: int, unit_price: Decimal)` - `Customer(id: str, tier: str)` — tier is one of "bronze", "silver", "gold" - `Order(customer: Customer, items: list[LineItem], region: str, coupon_code: str | None = None)` — region is one of "US-CA", "US-NY", "EU", "NONE" - `PricedOrder` with these fields, each a `Decimal`: `subtotal, volume_discount, tier_discount, coupon_discount, taxable_base, tax, total`, plus `lines`. - `PricingEngine` with a method `price_order(order: Order) -> PricedOrder`. All monetary values are `decimal.Decimal`. Constructors accept keyword arguments (e.g. `LineItem(sku="a", qty=1, unit_price=Decimal("9.99"))`). # Build a pricing engine `price_order` runs this fixed pipeline, in order. Each step's rounded result feeds the next. 1. **Subtotal** = sum of `qty * unit_price` over all line items. 2. **Volume discount** on the subtotal. Highest matching threshold wins: - subtotal >= 10000 → 18% - subtotal >= 5000 → 12% - subtotal >= 1000 → 5% - otherwise 0% `after_volume = subtotal - volume_discount` 3. **Customer-tier discount** on `after_volume`: - bronze → 0%, silver → 3%, gold → 7% `after_tier = after_volume - tier_discount` 4. **Coupon discount** on `after_tier`: - `SAVE10` → 10% of `after_tier` - `FLAT50` → $50 off (but not more than `after_tier`) - `BOGO` → on the line with the lowest `unit_price`, `qty // 2` units are free (discount = that unit_price × free units) - any other code → no discount `taxable_base = after_tier - coupon_discount` 5. **Sales tax** on `taxable_base`, by region: - US-CA → 8.25%, US-NY → 8.875%, EU → 20%, NONE → 0% `total = taxable_base + tax`. All monetary amounts are rounded to 2 decimals.
Human-readable — you can read and audit every rule.
Prompt · T4 · exhaustive 2,843 B · 100% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `pricing_engine` (importable as `import pricing_engine`) exposing exactly these names at the package top level: - `LineItem(sku: str, qty: int, unit_price: Decimal)` - `Customer(id: str, tier: str)` — tier is one of "bronze", "silver", "gold" - `Order(customer: Customer, items: list[LineItem], region: str, coupon_code: str | None = None)` — region is one of "US-CA", "US-NY", "EU", "NONE" - `PricedOrder` with these fields, each a `Decimal`: `subtotal, volume_discount, tier_discount, coupon_discount, taxable_base, tax, total`, plus `lines`. - `PricingEngine` with a method `price_order(order: Order) -> PricedOrder`. All monetary values are `decimal.Decimal`. Constructors accept keyword arguments (e.g. `LineItem(sku="a", qty=1, unit_price=Decimal("9.99"))`). # Build a pricing engine `price_order` runs this fixed pipeline, in order. **Rounding rule (applies to every monetary quantity below):** round to exactly 2 decimal places using banker's rounding (`decimal.ROUND_HALF_EVEN`, i.e. `quantize(Decimal("0.01"), ROUND_HALF_EVEN)`). Each discount is rounded *before* it is subtracted, so each step consumes an already-rounded number. 1. **Subtotal** = `sum(unit_price * qty)` over all line items, then rounded. 2. **Volume discount** = `round(subtotal * rate)`, where `rate` is chosen by the highest matching threshold (check in this order, first match wins): - subtotal >= 10000 → 0.18 - subtotal >= 5000 → 0.12 - subtotal >= 1000 → 0.05 - else → 0.00 `after_volume = subtotal - volume_discount` 3. **Tier discount** = `round(after_volume * rate)`: - bronze → 0.00, silver → 0.03, gold → 0.07 `after_tier = after_volume - tier_discount` 4. **Coupon discount** on `after_tier` (rounded): - `SAVE10` → `round(after_tier * 0.10)` - `FLAT50` → `round(min(after_tier, 50.00))` (never discount more than the base) - `BOGO` → pick the line item with the lowest `unit_price` (if tie, any is fine); `free_units = that line's qty // 2` (integer floor division); discount = `round(that unit_price * free_units)` - missing coupon, or any unrecognized code → 0.00 (silently ignored) `taxable_base = after_tier - coupon_discount`; if this is negative, clamp it to 0.00. 5. **Tax** = `round(taxable_base * rate)` by region: - US-CA → 0.0825, US-NY → 0.08875, EU → 0.20, NONE → 0.00 `total = taxable_base + tax`. **Validation:** `LineItem` rejects `qty <= 0` and negative `unit_price` (raise `ValueError`); `Customer` rejects a tier not in {bronze, silver, gold}. Populate every `PricedOrder` field: `subtotal`, `volume_discount`, `tier_discount`, `coupon_discount`, `taxable_base`, `tax`, `total`, and `lines` (the list of input line items).
Human-readable — you can read and audit every rule.
vs
Archive · xz of the source 2,548 B
fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 16 00 00 00 74 2f e5 a3 e0 17 02 07 97 5d 00 33 1c 8a 22 6f a9 32 41 af a2 0b f3 56 d5 39 60 bd c5 11 6f e7 e3 33 3c 05 a2 27 81 e7 0e cb ea 5f b4 d6 53 e3 a3 62 02 34 5f b0 18 7f 83 4c 0f f9 b3 a4 d2 06 f3 2b 03 18 c6 1e 0b 8d f3 28 d7 b3 54 ce 35 70 70 a0 39 87 69 32 a7 c6 5a b3 2b 35 3c 32 97 68 d3 fd d4 f8 00 e3 b7 23 78 ec 32 f7 83 67 ef 98 88 49 55 20 33 d2 8f ca 4f 9d 64 a3 ea 28 7c 97 5b f0 04 db 2a fc f0 ff cb 53 56 91
Denser, but opaque binary — only a decompressor can read it.
1,060 B (0.42× the 2,548-B xz archive, smaller than it) but only 15% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
1,742 B (0.68× the 2,548-B xz archive, smaller than it) but only 17% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
1,986 B (0.78× the 2,548-B xz archive, smaller than it) but only 94% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
100% faithful — but 2,843 B is 1.12× the 2,548-B xz archive. When the content is arbitrary, prose can’t out-pack the byte compressor.

Every encoding of this project, drawn to scale (bar length = bytes · label = Haiku fidelity, mean of 5)

Source · uncompressed
5,891 B the original
T1 · one line
1,060 B 15% passed
T2 · paragraph
1,742 B 17% passed
T3 · full spec
1,986 B 94% passed
T4 · exhaustive
2,843 B 100% passed
ZIP
3,146 B 100% · lossless
GZ
2,684 B 100% · lossless
XZ
2,548 B 100% · lossless

A realistic RBAC/ABAC engine: role inheritance, wildcard resource patterns, conditions. Most of it is conventional — the novelty hides in a few arbitrary precedence rules, so it touches only a minority of checks.

Prompt · T1 · one line 1,324 B · 83% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `rbac` (importable as `import rbac`) exposing exactly these names at the package top level: - `Permission(action: str, resource: str, effect: str = "allow", conditions: dict | None = None)` — `action` is an action name (e.g. "read") or "*"; `resource` is a `:`-delimited pattern; `effect` is "allow" or "deny"; `conditions` maps context-key to required value. - `Role(name: str, permissions: list[Permission] = [], inherits: list[str] = [])` - `Principal(id: str, roles: list[str] = [], attributes: dict | None = None)` - `Decision` with fields: `allowed: bool`, `effect: str`, `matched: str`, `reason: str`. - `AuthorizationEngine(roles: list[Role])` with a method `check(principal: Principal, action: str, resource: str, context: dict | None = None) -> Decision`. All constructors accept keyword arguments (e.g. `Permission(action="read", resource="doc:*", effect="allow")`). Resources and patterns are `:`-delimited token strings such as `"doc:reports:q4"`. # Build an authorization engine Build a role-based access-control engine: principals hold roles, roles grant allow/deny permissions on resource patterns and can inherit other roles, and `check` returns whether an action on a resource is allowed.
Human-readable — you can read and audit every rule.
Prompt · T2 · paragraph 2,058 B · 94% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `rbac` (importable as `import rbac`) exposing exactly these names at the package top level: - `Permission(action: str, resource: str, effect: str = "allow", conditions: dict | None = None)` — `action` is an action name (e.g. "read") or "*"; `resource` is a `:`-delimited pattern; `effect` is "allow" or "deny"; `conditions` maps context-key to required value. - `Role(name: str, permissions: list[Permission] = [], inherits: list[str] = [])` - `Principal(id: str, roles: list[str] = [], attributes: dict | None = None)` - `Decision` with fields: `allowed: bool`, `effect: str`, `matched: str`, `reason: str`. - `AuthorizationEngine(roles: list[Role])` with a method `check(principal: Principal, action: str, resource: str, context: dict | None = None) -> Decision`. All constructors accept keyword arguments (e.g. `Permission(action="read", resource="doc:*", effect="allow")`). Resources and patterns are `:`-delimited token strings such as `"doc:reports:q4"`. # Build an authorization engine Build a role-based access-control engine. A `Permission` grants or denies an `action` on a `resource` pattern, optionally gated by `conditions`. A `Role` bundles permissions and may inherit other roles. A `Principal` holds a set of roles. `check` decides whether a principal may perform an action on a resource. - **Resource patterns** are `:`-delimited. A `*` token is a wildcard; a trailing `*` matches any deeper path. - **Roles inherit** permissions from the roles they list in `inherits`, transitively. - **Conditions** are matched against a request context (the principal's attributes plus the `context` argument). - When several permissions match, the **most specific** one wins, and **deny** takes precedence over **allow** when they are equally specific. If nothing matches, the request is denied. Return a `Decision` recording whether it was `allowed`, the winning `effect`, which permission `matched`, and a short `reason`.
Human-readable — you can read and audit every rule.
Prompt · T3 · full spec 3,039 B · 98% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `rbac` (importable as `import rbac`) exposing exactly these names at the package top level: - `Permission(action: str, resource: str, effect: str = "allow", conditions: dict | None = None)` — `action` is an action name (e.g. "read") or "*"; `resource` is a `:`-delimited pattern; `effect` is "allow" or "deny"; `conditions` maps context-key to required value. - `Role(name: str, permissions: list[Permission] = [], inherits: list[str] = [])` - `Principal(id: str, roles: list[str] = [], attributes: dict | None = None)` - `Decision` with fields: `allowed: bool`, `effect: str`, `matched: str`, `reason: str`. - `AuthorizationEngine(roles: list[Role])` with a method `check(principal: Principal, action: str, resource: str, context: dict | None = None) -> Decision`. All constructors accept keyword arguments (e.g. `Permission(action="read", resource="doc:*", effect="allow")`). Resources and patterns are `:`-delimited token strings such as `"doc:reports:q4"`. # Build an authorization engine `check(principal, action, resource, context=None)` decides a request by collecting every permission that applies to the principal and resolving conflicts. Implement it as follows. **1. Effective permissions.** Collect the permissions of every role the principal holds, plus the permissions of roles reached transitively through `inherits`. Inheritance cycles must not loop forever (visit each role once). Role names that are not defined are ignored. **2. Which permissions apply.** A permission applies to the request when all of: - **Action matches:** `permission.action == action`, or `permission.action == "*"`. - **Resource matches:** patterns and resources are `:`-delimited token lists. A `*` token matches exactly one token. If the *last* pattern token is `*`, it is a prefix wildcard: the tokens before it must match and any remaining resource tokens are accepted (so `doc:*` matches `doc:reports:q4`). Otherwise the pattern and resource must have the same number of tokens. - **Conditions hold:** build a context mapping from the principal's `attributes` combined with the `context` argument, then every entry in `permission.conditions` must equal the matching value in that mapping. Empty conditions always hold. **3. Resolve conflicts.** Among the applying permissions, the **most specific** resource pattern wins — a pattern with more literal (non-`*`) tokens is more specific, and an exact pattern (no wildcard at all) beats one containing a wildcard. When the winning specificity is tied, **deny beats allow**. So a more specific *allow* overrides a less specific *deny* — this is most-specific-wins, not deny-overrides-everything. **4. Default.** If no permission applies, the request is denied. Return a `Decision` with `allowed` (true iff the winning effect is "allow"), the winning `effect`, a `matched` string identifying the deciding permission (or "<default>"), and a short `reason`.
Human-readable — you can read and audit every rule.
Prompt · T4 · exhaustive 3,888 B · 86% passed
## Required public API (must match exactly so automated tests can call it) Build a Python package named `rbac` (importable as `import rbac`) exposing exactly these names at the package top level: - `Permission(action: str, resource: str, effect: str = "allow", conditions: dict | None = None)` — `action` is an action name (e.g. "read") or "*"; `resource` is a `:`-delimited pattern; `effect` is "allow" or "deny"; `conditions` maps context-key to required value. - `Role(name: str, permissions: list[Permission] = [], inherits: list[str] = [])` - `Principal(id: str, roles: list[str] = [], attributes: dict | None = None)` - `Decision` with fields: `allowed: bool`, `effect: str`, `matched: str`, `reason: str`. - `AuthorizationEngine(roles: list[Role])` with a method `check(principal: Principal, action: str, resource: str, context: dict | None = None) -> Decision`. All constructors accept keyword arguments (e.g. `Permission(action="read", resource="doc:*", effect="allow")`). Resources and patterns are `:`-delimited token strings such as `"doc:reports:q4"`. # Build an authorization engine `check(principal, action, resource, context=None)` runs this exact procedure. **1. Effective permissions.** Starting from `principal.roles`, do a depth-first walk following each role's `inherits`. Visit each role at most once (so cycles terminate). Skip any role name not defined in the engine. Collect each visited role's permissions, remembering which role each came from. **2. Filter to applying permissions.** A permission applies iff all three hold: - **Action:** `permission.action == action` OR `permission.action == "*"`. - **Resource:** split both pattern and resource on `":"`. - If the pattern's **last** token is `"*"` (trailing/prefix wildcard): let `prefix = pattern_tokens[:-1]`. It matches iff `len(resource_tokens) >= len(prefix)` and, for each `i < len(prefix)`, `prefix[i] == "*"` or `prefix[i] == resource_tokens[i]`. (So `doc:*` matches `doc`, `doc:reports`, and `doc:reports:q4`; `*` matches everything.) - Otherwise: it matches iff the token counts are equal and each pattern token is `"*"` or equals the resource token at that position (so `doc:*:q4` matches `doc:reports:q4` but not `doc:reports:q3` or `doc:reports`). - **Conditions:** build `ctx = {**principal.attributes, **(context or {})}` — i.e. the principal's attributes overlaid by the `context` argument, with `context` winning on conflicts. Every key/value in `permission.conditions` must equal `ctx.get(key)`. Empty conditions always pass. **3. Rank and pick a winner.** For each applying permission compute the tuple ``` key = (literal_token_count, exact_flag, action_exact_flag, deny_flag) ``` where `literal_token_count` = number of pattern tokens that are not `"*"`; `exact_flag` = 1 if the pattern contains no `"*"` at all else 0; `action_exact_flag` = 1 if `permission.action != "*"` else 0; `deny_flag` = 1 if `effect == "deny"` else 0. The winner is the permission with the **largest** key (compare tuples left to right). This means: most literal tokens wins; then exact-over-wildcard; then exact-action-over-`*`; and finally, only as the last tie-break, **deny over allow**. **4. Decision.** - If at least one permission applies: `effect` = the winner's effect; `allowed = (effect == "allow")`; `matched` = a string identifying the winning permission (include its role, effect, action, and resource). - If none apply: `allowed = False`, `effect = "deny"`, `matched = "<default>"`. Always populate `Decision.reason` with a short human-readable explanation. **Validation.** `Permission` raises `ValueError` if `effect` is not "allow" or "deny". `Role` raises `ValueError` on an empty name. A `Permission` with `conditions=None` behaves as having no conditions; a `Principal` with `attributes=None` behaves as having empty attributes.
Human-readable — you can read and audit every rule.
vs
Archive · xz of the source 3,340 B
fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 16 00 00 00 74 2f e5 a3 e0 1d 6f 0a 59 5d 00 11 68 0c 44 07 39 ce 3c b6 de c4 54 3d 85 e8 ac bb 0e 87 96 14 14 55 81 13 09 32 37 5b b3 8b c2 42 b0 ec a3 5f e9 51 e9 16 29 5f 87 ab 43 09 a9 2c d1 3e b6 c1 4a 5f 6b 67 a2 9f 5e 54 a7 90 91 af 8d f2 34 24 8f 23 92 b0 a2 b0 d5 34 63 14 09 5e 91 fa c4 fe 52 31 15 ea 31 dd b7 47 b8 0a 1e c6 9b ab 6b 0c fc 7c bc b9 15 47 2a 2d 36 f4 ad 9a e1 11 c3 a7 ad 97 05 66 60 e4 9c d8 0e 1d db db
Denser, but opaque binary — only a decompressor can read it.
1,324 B (0.40× the 3,340-B xz archive, smaller than it) but only 83% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
2,058 B (0.62× the 3,340-B xz archive, smaller than it) but only 94% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
3,039 B (0.91× the 3,340-B xz archive, smaller than it) but only 98% faithful. Cheaper than the archive, but lossy — the archive is always 100%.
3,888 B (1.16× the 3,340-B xz archive, larger than it) but only 86% faithful. Both larger and lossy — the archive is always 100%.

Every encoding of this project, drawn to scale (bar length = bytes · label = Haiku fidelity, mean of 5)

Source · uncompressed
7,536 B the original
T1 · one line
1,324 B 83% passed
T2 · paragraph
2,058 B 94% passed
T3 · full spec
3,039 B 98% passed
T4 · exhaustive
3,888 B 86% passed
ZIP
3,831 B 100% · lossless
GZ
3,500 B 100% · lossless
XZ
3,340 B 100% · lossless

The measurement

How a reconstruction is graded

No grading by eye. Each rebuild is run against a held-out test — a battery of inputs with known-correct outputs, exactly like a unit-test suite — that the agent never sees. Its score, fidelity, is just the fraction of checks it passes.

  1. 1
    Prompt
    the spec, plus a fixed API header so the test can call the result
  2. 2
    Fresh agent
    standard library only · no repo access · never sees the reference code or the test
  3. 3
    Rebuilt package
    an importable program
  4. 4
    Held-out test
    65 / 41 / 53 input→output checks it has never seen
  5. =
    Fidelity
    the fraction of checks it passes

What a “check” is: one input→output assertion — e.g. price this specific order, expect total = $1,197.03. Each project ships its own test: 65 checks for roman numerals, 41 for the grade calculator, 53 for the pricing engine. Each cell is run 5×; how often those 5 rebuilds disagree is the divergence we report later.

A note on the name. In the code this test is called the oracle (a testing term for a trusted source of correct answers). It is nothing more exotic than a held-out behavioural test — so that is what we call it here.

Headline numbers

Four findings, and where each comes from

Every figure below traces straight back to the 80 runs. The small print under each number shows the arithmetic.

96% → 17%
Fidelity at roughly half the xz byte budget — low-novelty vs high-novelty.
Roman numerals’ 581-B prompt (half of xz’s 1,160 B) passes 96% of its 65 checks. A similarly small pricing-engine prompt passes only ~15–17% of its 53.
1.12–1.33×
How much larger a perfectly faithful prompt is than the xz archive.
The T4 prompt that scores 100%: roman 1,545 / 1,160 = 1.33×; pricing 2,843 / 2,548 = 1.12×. Spelled-out prose is never as dense as a byte compressor.
up to 96%
Of outputs disagree across identical reruns of an underspecified prompt.
Pricing engine, T2 prompt, the same prompt run 5×: 96.2% of outputs differ. Roman numerals at T4: 0%. A real codec is never random.
120 runs
Independent reconstructions behind every number on this page.
4 projects × 4 prompt tiers × 5 repeats on Haiku = 80, plus Sonnet arms on the two high-novelty projects = 40.

Results

The curves

120 reconstructions: 4 projects × 4 prompt tiers × 5 repeats on Haiku, plus Sonnet arms on the two high-novelty projects.

In plain terms — what “decompressing” a prompt means. The prompt is your instructions for rebuilding a code library the AI can’t see — like describing a picture so a friend can redraw it. How well it works depends on how much the AI already knows. Conventional stuff it fills in for free, so a tiny prompt rebuilds it almost perfectly. Genuinely made-up stuff it can’t guess, so a short prompt rebuilds little and you must spell out every detail — and wherever you leave a gap, the AI guesses, differently each run (so, unlike unzipping a file, the same prompt can produce different programs). The punchline: how much you must write depends not on how big the project is, but on how much of it is new to the AI. Spread the novelty through everything (the pricing engine) and a thin prompt collapses to ~15%. Tuck it into a few odd rules (the authorization engine) and a thin prompt still rebuilds ~83%, failing only on those rules.
fidelity vs prompt detail
Figure 1 — More detail buys more fidelity, but the shape depends on novelty. Roman numerals start near the top and stay there. The pricing engine (pervasive novelty) sits near zero until the prompt finally writes out its arbitrary tables — then it snaps to ~100%. The RBAC engine (localized novelty) starts high and only inches up: most of it is conventional, so little has to be said. Its big T4 error bar is a single Haiku run that melted down — the other four, and all of Sonnet, scored 100%.
size vs fidelity frontier
Figure 2 — Size against fidelity. The vertical line is the xz archive (always 100%). For low novelty, prompts sit to its left — smaller and faithful. For high novelty, the only prompt that reaches 100% sits to its right — larger than the archive.
decompressor strength
Figure 3 — A stronger agent does not recover missing information. When the prompt omits the arbitrary constants, Sonnet is no more accurate than Haiku — it is only more consistent from one run to the next.
run-to-run divergence
Figure 4 — Underspecified prompts decompress non-deterministically. Run the same thin prompt five times and the outputs disagree (up to 96% of the time). The model is filling gaps by guessing, and it guesses differently each time.

The fourth project

When the collapse doesn’t happen

The pricing engine collapses to ~15% on a thin prompt. A bigger, more realistic project — an RBAC authorization engine, 7.5 KB of source across four files — does the opposite: its thinnest one-line prompt still passes ~83% of the test.

The difference is where the novelty lives. Pricing’s arbitrary constants feed every output, so a vague prompt gets almost everything wrong. RBAC is mostly conventional — basic allow, default-deny, role inheritance, wildcard matching are all in the model’s prior — and the genuinely arbitrary parts (a counterintuitive precedence rule, an edge in pattern matching) live in only a handful of checks. So the prompt reconstructs the bulk for free, and the lossiness concentrates precisely on those few rules. One rule — exact-action beating a wildcard deny — was missed by 27 of 40 rebuilds, even when spelled out.

The sharpened claim. A prompt collapses only when novelty is pervasive. When novelty is localized, even a sub-archive prompt rebuilds most of the system — and the loss is surgical, landing exactly on the arbitrary rules and nowhere else.

Takeaway

A prompt is not a zip file. It is a lossy, prior-relative compressor: it encodes only the gap between your project and what the model already expects, and it trades size for fidelity along a curve. The archive is one point, pinned at 100%. The prompt is the whole curve — cheap and faithful when the work is conventional, expensive and unreliable exactly where it is genuinely, pervasively new.