Using Expensive + Cheap LLMs Together: A Synthetic Data Architecture

The Core Idea

This architecture exploits a simple cost asymmetry: expensive models are good at design and verification, cheap models are good at bulk execution. You use the expensive one to build the scaffolding, the cheap one to do the repetitive work, then the expensive one again for a final quality pass rendered as an interactive HTML viewer you can inspect by eye.

The pipeline has three phases:

Scaffold the expensive model designs your dataset schema, validation rules, and output format
Generate the cheap model produces the bulk data at a fraction of the cost
Inspect & Repair the expensive model builds a single-shot HTML viewer, you eyeball the rows, copy the bad ones, and paste them back into Codex for a bulk fix

Models are plug and play. This piece uses GPT-5.5 and Opus 4.7 as the expensive tier, and GLM, Kimi 2.6, and DeepSeek V4 Flash as the cheap tier. Swap them freely as pricing and capability shift.

Why JSON and XML Fail for Human Review

Structured formats are machine-native, not human-native. A 5000-row JSON dataset is unreadable. XML is verbose and collapses under its own angle brackets. Both formats require mental parsing, where you have to hold the structure in your head while scanning for anomalies.

HTML solves this because it maps directly to visual layout. Tables, color coding, conditional formatting, collapsible sections…these are things the human visual cortex processes in parallel. As Andrej Karpathy put it on May 11, 2026: asking your LLM to “structure your response as HTML” and viewing it in a browser just works. The brain’s roughly one-third dedicated to vision makes HTML the natural output format for anything a human needs to judge.

Phase 1: Scaffold With the Expensive Model

Prompt the expensive model to design everything up front. This is the one phase where cost doesn’t matter because the output is small:

Dataset schema: column names, types, constraints, allowed values
Synthetic generation rules: distributions, edge cases, realistic noise patterns
Validation rules: what makes a row good vs. bad
HTML viewer template: the exact rendering logic for the inspection pass

The expensive model will produce a precise, well-structured spec. Keep it as a reference artifact, as you’ll feed it to the cheap model in Phase 2.

A typical scaffold prompt looks like:

Design a synthetic dataset of 500 customer support tickets with the following
schema: ticket_id, category, priority, customer_sentiment, resolution_time_hours,
agent_response_quality, contains_personal_data. Include realistic distributions,
edge cases, and a set of validation rules. Output as a structured spec I can
feed to another LLM for generation.

Phase 2: Bulk Generation With the Cheap Model

Feed the scaffold spec to a cheap model and let it generate the bulk dataset. This is where you save 90%+ of your compute budget.

Model options for this tier (as of mid-2026):

Model	Cost (per 1M tokens)	Strengths
GLM-4 Flash	~$0.01	Solid general-purpose, handles structured output well
Kimi 2.6 Flash	~$0.02	Strong instruction following, good at constrained generation
DeepSeek V4 Flash	~$0.01	Fast, reliable, handles tabular data generation cleanly

The prompt for the cheap model should include:

The schema and constraints from Phase 1
The exact output format (JSON array, one object per row)
An instruction to intentionally include 5–10% “bad” rows (duplicates, missing fields, impossible values, format violations) so the inspection phase has something real to find

This intentional error injection is critical. Perfect synthetic data teaches you nothing about your pipeline’s failure modes.

Phase 3: HTML Viewer for Manual Inspection

Now send a sample of the generated data back to the expensive model with this instruction:

Take this dataset and build a single HTML file that renders it as an
interactive, visually scannable table. Red-flag bad rows with background
color. Allow collapsing/expanding row detail. Include summary statistics
at the top. One-shot it — no dependencies, inline CSS and JS only.

The expensive model produces a self-contained .html file you open in any browser. No framework, no build step, no server. Just double-click and look.

What you get:

Visual scanning: color-coded rows, sortable columns, highlighted anomalies
Context preservation: each row shows full detail on expand, not just the summary
Judgment surface: your eyes catch things automated validators miss (tone) inconsistencies, semantically wrong values, “looks right but isn’t” cases

The HTML viewer is the bridge between machine generation and human judgment. Without it, you’re staring at raw data.

Phase 4: Copy, Paste, Fix

Once you’ve identified bad rows in the HTML viewer:

Select and copy the problematic rows (the HTML table makes this trivial — visual selection, not grep)
Paste them back into Codex (or any capable code environment)
Ask for a full corpus scan: “Here are examples of bad rows. Scan the entire dataset and fix all similar issues.”

Codex excels at this pattern because it can:

Parse the JSON programmatically
Identify all instances of each bug class (not just the ones you spotted)
Apply fixes consistently across the full dataset
Output a clean version and a diff report

This loop is the key insight: human inspection finds the pattern, automation fixes it everywhere.

The Broader Pattern

This isn’t just about synthetic data. It’s a general architecture for any pipeline where:

An expensive model does the hard design thinking (small output, high quality)
A cheap model does the bulk execution (large output, acceptable quality)
The expensive model returns for verification in a human-readable format
A developer or QA agent closes the loop

The expensive model is the architect and inspector. The cheap model is the workforce. HTML is the interface between them and you.

Why This Works in 2026

The input/output progression Karpathy described is real and accelerating:

Raw text was hard and effortful
Markdown made it scannable
HTML gives us layout, color, interactivity — a genuine perceptual upgrade
(Future) Interactive neural simulations

Right now, asking an LLM to “structure as HTML” is a practical, immediate improvement over reading raw model output. It costs nothing extra and leverages the visual processing hardware already sitting in your skull.

The model layer is plug and play. Frontier models get better and cheaper every quarter. The architecture stays the same. This is expensive for design, cheap for bulk, HTML for the human in the loop.

References

Andrej Karpathy, May 11, 2026: “structure your response as HTML, then view the generated file in your browser” x.com/karpathy
Original corpus inspection pattern: x.com/trq212
OpenAI Codex: openai.com/codex
GLM-4: glm.ai
Kimi 2.6: moonshot.ai
DeepSeek V4: deepseek.ai

Crepi il lupo! 🐺