Is prompt engineering more leveraged than the harness, long-term?
As base models keep getting better at following short, loose prompts, does the long-term leverage shift back to prompt-craft, or does it stay with the harness around the agent (typed adapters, file-driven planning, adversarial review, test scaffolds)? For code I expect to maintain in a year or three, where’s the compounding return?
The tension
Prompt techniques are getting cheaper, stronger base models need less prompting. Harness work stays expensive to build and to keep current. If model capability keeps climbing fast enough, a lot of scaffolding could look like over-engineering in three years. On the other hand, prompts go stale every time a model changes; typed adapters, file-driven phases, and adversarial review loops don’t.
My current lean
Harness over prompt. The decisions and playbooks already on this site embody that lean:
- Why I keep LLMs behind typed adapters, structural, outlives prompt drift.
- A file-driven planning framework for AI-assisted coding, each phase reads and writes a file, crash-resumable.
- Adversarial review via a second AI model, a second model with different blindspots, priced in as a step.
The bet is that harness pays off more on maintenance than prompt-craft does, because maintenance is when the model, the codebase, and the team have all changed.
What would change my mind
Experienced practitioners shipping and maintaining long-running agent systems, Karpathy and the small group of people doing this work in public, landing on “the harness was over-engineering; prompt discipline would have carried it.” If the consensus among that group shifts toward prompt-first for long-lived code, revisit.
Status
Open. Not holding the lean tightly.