Making the frontier model cheaper *inside the editor* — an honest eval of preview-awareness and directness
- #yee
A companion to [phd.md](phd.md). That thesis asks whether SFT makes a small local model more correct. This one asks a different question about the frontier models we rent: can the editor's own ecosystem — its prompts, its live preview, its command policy — make the model running inside it spend fewer and steadier tokens to solve the same problem, without making it worse?
The short answer is yes, and the size of the win depends entirely on the model's failure mode. The honest answer is that most of it is a variance win, not a headline accuracy win — and we measured it rather than asserted it.
Abstract
A user reported gpt-5.2, mid-edit, asking to run npm run build to "verify" a CSS change — while a hot-reloading dev preview was already running. The model was blind to its own ecosystem. We shipped three interventions (a per-turn preview-awareness note, a deterministic dev/serve command interceptor that never touches build, and a "work directly" directive) and then built a synthetic agent-loop harness to ask whether any of it actually pays. The behavior metric we expected to move — does the model stop running redundant builds? — turned out to be near-ceiling at baseline ("barely better"). The metric that did move is tokens-per-task on reasoning models, and the mechanism is decisiveness: the note collapses gpt-5.2's per-task token variance from CV 75% to 22% (−32% mean, n=5–6). gpt-5.5 stayed noisy until log forensics showed why — it spelunks the shell (cat/sed/find/grep, re-running the same build) instead of using its tools — at which point a directness directive cut it −38% (n=12) and pulled its worst case from 34.2k to 18.2k tokens. All headline numbers here are directional: we have not yet run n≥20, and we report cherry-picked worst-case rather than P90 — read the magnitudes as direction, the variance-collapse as the robust finding. The per-org projection (§9) is an illustrative scenario, not a model.
What we claim — and what we refuse to claim
We claim:
- The preview-awareness note is safe — across every run, the type-check control still ran a real
tsc/typecheck(it never suppresses a legitimate build). - On reasoning models the note is a token-and-variance lever: gpt-5.2 −32% mean tokens, variance CV 75%→22% (n=5–6; a smaller n=2 probe showed −63%).
- gpt-5.5's token noise is diagnosable and steerable, not intrinsic unmeasurability — it is over-investigation via the shell, and a directness directive cuts it −38% mean and drops the worst case −47% (n=12).
We refuse to claim:
- That this makes the model more correct. It does not; it makes it cheaper and more predictable at the same outcome.
- A headline behavior improvement on "redundant builds" — baseline was already near-ceiling in our harness; the live anecdote is real but hard to reproduce synthetically.
- Any per-organization dollar figure as a measurement. §9 is a transparent model with stated assumptions.
- Statistical significance. Every result here is directional at small
n. The harness is built to crankn; we have not yet.
0. The observation
In a live easy-mode session, gpt-5.2 edited src/styles.css and src/pages/index.astro, then emitted a run_terminal call for npm run build — popping an approval modal at a user who had no idea why. A dev preview was already live at localhost:4322. The model's instinct ("edited code → build to verify") is correct in the abstract and wrong in this ecosystem: a dev server hot-reloads the edit already, and a production build does not even feed that preview. Nothing in the model's context told it the preview existed. The editor never taught the agent about its own ecosystem.
1. The thesis
The editor is not a passive host for a frontier model. Its prompts, its preview, and its command policy form an ecosystem that can make the model inside it spend fewer and steadier tokens for the same result — a per-task win that is negligible per person and large at population scale.
This is the recursive version of the local-model thesis: a code editor whose ecosystem makes the code-editing models inside it sharper is, itself, a code editor that ships better code editors.
2. The interventions (and where they live)
- Preview-awareness note —
app/synthwave.jspreviewContextNote(repo), injected per-turn intoctx.systemand merged withAGENT_SYSTEM_PROMPTatmain.js. Three-way by state: RUNNING (don't run dev/serve to preview — it's already live), PAUSED (tell the user to press ▶), STATIC (no dev server — building is the right verification). A no-spawnpreview:probedistinguishes app-vs-static. - Dev/serve interceptor —
app/agentLoop.jsclassifyRedundantDev. When a preview is verified-live (liveness re-checked viapreview:statusasserting the pid is alive, keyed to the owning repo captured at send time), a redundantnpm run dev/start/serveis skipped with a ~70-token note instead of spawning a duplicate server and ingesting its logs.buildis categorically excluded — a production build catches type/bundle errors a hot-reload server never shows, so it is never redundant. A force hatch (--force/YEE_FORCE=1) always runs the real command. Verified bybench/dev-policy-test.js(32/32: build never intercepted, dev/serve only when live, force respected). - Directness directive (
WORK DIRECTLY) — added §7 after the noise forensics: use the structured read/search/list tools, notcat/sed/find/grep/ls; never re-run the same command; open the file you need and stop.
3. The harness — `bench/preview_awareness_eval.js`
A synthetic agent loop: real callProvider turns against a faked tool environment (read returns a small file, edit returns success, run_terminal returns plausible output and is recorded). Up to 5 turns per run; we count the commands issued and sum usage.total across turns.
Two scenarios, chosen to catch both failure modes of a preview note:
- visual — recolor a button while a preview is running. Good = edits and runs no dev/serve/build (the note's job).
- typecheck — "make sure it type-checks." Good = the model does run a build/
tsc(the note must not suppress legitimate verification).
A good note maximizes both. Metrics: behavior (the two rates above) and tokens-per-task, reported with CV (coefficient of variation = sd/mean) because the interesting effect is variance, and a min/max ratio is hostage to a single efficient run. The harness interleaves per step (each step is a full model×variant comparison) and appends one JSONL row per run — crash-safe, partial-readable. Reproduce: node bench/preview_awareness_eval.js --models … --variants … --n … --jsonl out.jsonl.
4. Finding 1 — behavior is near-ceiling; the note is *safe*
In the synthetic harness the models rarely reach for a redundant build even with no note — gpt-4o edited-and-stopped, gpt-4.1 and gpt-5.2 issued no dev/serve/build on the visual task. So the headline behavior metric is "barely better," and we say so. What the note demonstrably does not do is break the legitimate case: on the typecheck control, every note variant still ran a real tsc/typecheck. The note is free insurance for the instruction-following models, and the live anecdote is the tail the synthetic harness under-samples.
5. Finding 2 — on reasoning models, the note is a token-and-variance lever
The effect that moved is cost, not behavior. gpt-5.2 (interleaved run, partial, n=5–6 per cell):
| variant | mean tokens/task | CV | range | |---|---|---|---| | none | 9,101 | 75% | 3,291 – 20,256 | | note | 6,213 | 22% | 4,464 – 7,634 |
−32% mean, and the variance collapses — note-on runs land in a tight band instead of swinging 6×. (An earlier n=2 probe showed −63%; the direction is robust, the magnitude is n-dependent.) The mechanism: without the note the reasoning model wanders before acting; with it, it goes straight to the edit. The note costs ~150 input tokens/turn and pays it back several times over.
6. The noise — what gpt-5.5 actually did
gpt-5.5 refused to firm up: CV 50%, a 6.4× range (4,091 – 26,358 tokens) on the same task. Rather than blame the model, we read the logged commands. The pattern is unmistakable — it uses the terminal as its primary interface:
- reads files through the shell —
sed -n '1,240p' src/styles.css | tail -n 80,cat package.json && find src -maxdepth 3 … - searches through the shell —
grep -n -A12 -B6 -i "get in touch|primary|button|outline" - re-runs the same diagnostic —
npm run typecheck | npm run typecheck -- --pretty false,npm run build | npm run build > /tmp/… - chains giant exploratory pipes —
npm run typecheck | npm run | find src -maxdepth 3 …
Every command dumps output back into context that it then reasons over, and how deep it decides to dig is the variance. The noise is over-investigation, not randomness — therefore steerable.
7. Intervention v2 — the directness directive
WORK DIRECTLY (this saves you turns and tokens): to read, search, or list code, use your read_file / search / list_files tools — never the terminal. Do not run cat, sed, head, tail, find, grep, or ls to explore; you already have direct tools that are faster and do not spend a turn. Run any shell command at most once — never pipe or repeat the same command, and do not chain diagnostic commands to "investigate". Open only the file you need, make the change, and stop.
Tested as a direct variant before shipping.
8. Finding 3 — directness tames the noisy model
gpt-5.5, none vs direct (direct-test, completed, n=12 = 6 steps × 2 scenarios):
| variant | mean tokens/task | CV | worst case | |---|---|---|---| | none | 17,586 | 49% | 34,154 | | direct | 10,865 | 44% | 18,159 |
−38% mean, CV 49%→44%, and the worst case drops 47% (34,154→18,159) — the directive's biggest effect is killing the catastrophic spelunking runs that pushed none past 34k tokens. gpt-5.5 remains higher-variance than gpt-5.2 (its nature), but the directive measurably pulls it off the shell-spelunking and toward the actual change. This is the result that earns the directive a home in EASY_QUALITY_PROMPT (live for every easy-mode model, not just the noisy one).
9. From per-task to per-org — an illustrative scenario, NOT a model
The per-task win is a rounding error per person; the thesis is scale. A separate, fully-imagined lever — auto-saving every AI-generated shell script to a shared, deduped, reusable library so the agent reuses instead of regenerating — if you grant the assumptions (100k people · 3 script-asks/week · ~5,000 tokens to generate · ~50 to reuse · reuse-rate climbing a logistic toward 70%), arithmetically yields ~16.8B tokens saved over 26 weeks (~$84k at ~$5/M; ~168k tokens / ~$0.84 per person). None of those assumptions is measured in Yee — there is no script library yet, and the 70% reuse-rate is invented. The result is most sensitive to that reuse-rate logistic: halve it and the headline halves. Treat this section as a back-of-envelope scenario that motivates building the library (and then measuring it, per §6 of future.md), not as a forecast. The qualitative point is the durable one: scripts that used to vanish become a cohesive, searchable asset instead of every person reinventing deploy.sh.
10. The recursive thesis
The measured pieces (§5, §8) and the modeled piece (§9) point the same way: the editor's ecosystem makes the frontier model inside it cheaper and steadier at the same outcome. A code editor that does that is a code editor that ships better code editors — and it gets cheaper and more cohesive the more people use it.
11. Limitations
- Small
n. Every measured result is directional. The harness is built to crankn; we ran 2–10. - Synthetic environment. Faked tool results under-sample the messy real repos where the live
npm run buildanecdote actually bites. - Variance dominates on heavy reasoners. gpt-5.5's CV stays ~44% even helped; conclusions about it need higher
nor areasoning_effort-pinned scenario. - §9 is an illustrative scenario, not a model. No figure there is measured; the result hinges on an invented reuse-rate.
- The directness directive risks being too rigid. "Run any shell command at most once — never repeat" (§7) targets spelunking, but real deploy-debugging legitimately chains and repeats commands (tail a log → fix → re-run; grep → act). The synthetic bench under-samples this (faked tools, single-edit tasks), so the "at most once" wording is a known over-reach to revisit — the goal is killing aimless investigation, not methodical iteration.
- Scope. All of this is easy-mode-only; the local-LLM ("Hard") route does not yet carry these prompts or the interceptor.
12. Reproduce it
- Token + behavior A/B:
node bench/preview_awareness_eval.js --models openai:gpt-5.2,openai:gpt-5.5 --variants none,current,direct --n 10 --jsonl out.jsonl - Command policy unit matrix:
node bench/dev-policy-test.js(32/32) - Per-run rows land in the JSONL live; nothing is lost on a crash.
13. Lab log
- E1 (2026-06-29, n=2) — first token probe. gpt-5.2 −63%, gpt-5.5 −38%. Large effect, tiny
n. - E2 (2026-06-29, n=10 interleaved, partial) — gpt-5.2 note −32%, CV 75%→22% (the variance-collapse result). gpt-5.5 too noisy to separate at low
n→ paused deliberately. - E3 (2026-06-29, n=12, direct-test, completed) — root-caused gpt-5.5's noise to shell-spelunking via log forensics; the directness directive cut it −38% mean, CV 49%→44%, worst case −47% (34,154→18,159). Shipped into
EASY_QUALITY_PROMPT.
14. Future work
- Ship
WORK DIRECTLYintoEASY_QUALITY_PROMPT; re-measure all models at highern. - A
reasoning_effort-pinned scenario to make gpt-5.5 measurable, then a clean n=20+ verdict. - Extend the awareness + directness prompts to the local-LLM route.
- Build the §9 shared-script library and replace the projection with a measurement.