Industry Insights

OpenAI's Symphony Proved the Opposite of What It Was Supposed to Prove

Gabriella Gonzalez tested OpenAI's Symphony project — their flagship example of spec-driven code generation — and it failed to produce a working implementation. The spec itself was 1/6 the length of the Elixir codebase and contained literal pseudocode.

Sean McLellan

Lead Architect & Founder

March 18, 20265 min read

OpenAI's Symphony project shipped with a bold claim: give your favorite coding agent a specification document and it will build a working agent orchestrator in any language. The README invited exactly that experiment. Gabriella Gonzalez, a Haskell engineer and author of the popular Haskell for all blog, took them up on it. She pointed Claude Code at the SPEC.md file, asked it to build Symphony in Haskell, and documented every result.

The implementation did not work. Multiple bugs required manual prompting to fix. Even after the fixes compiled cleanly, the orchestrator spun silently on a trivial test ticket — "create a blank git repository" — without making progress.

That outcome alone would be a useful data point. But Gonzalez went further and examined why it failed, and her findings turn the entire "specs replace code" argument inside out.

SPEC.md reads like the code it was supposed to replace

Gonzalez methodically catalogues what Symphony's specification actually contains. It includes prose dumps of database schemas (field names, types, nullability). It includes backoff formulas written in pseudocode (delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)). It includes a section explicitly labeled "Cheat Sheet" that the document admits is "intentionally redundant so a coding agent can implement the config layer quickly." And it includes a full reference implementation section with function signatures, state dictionaries, and control flow written in a language-agnostic pseudocode that is, functionally, code.

The SPEC.md file clocks in at roughly one-sixth the length of Symphony's included Elixir implementation. That ratio matters. The promise of spec-driven development is that specifications are simpler than code — that writing a spec is a cheaper, higher-leverage activity than writing the implementation. A spec that is already one-sixth the size of the codebase and still growing toward it has not escaped the complexity; it has redistributed it into a format with less tooling, no compiler, and no test suite.

Gonzalez invokes Dijkstra's 1979 observation on the fantasy of communicating with machines in natural language: "We have to challenge the assumption that this would simplify man's life." Greek mathematics stalled because it stayed verbal. Modern mathematics only accelerated when it embraced formal symbolism. The parallel is direct — you cannot make a specification precise enough to generate reliable code without the specification converging on code.

The YAML test: even mature specs produce non-conforming implementations

One of the strongest pieces of evidence in the post has nothing to do with Symphony. Gonzalez points to the YAML specification — one of the most detailed, widely-used, community-tested specs in software — and notes that the vast majority of YAML implementations still do not fully conform to it. YAML has a formal conformance test suite. It has decades of iteration. Implementations still diverge.

If a specification that mature and that heavily scrutinized cannot reliably produce conforming implementations, expecting a markdown document in a GitHub repo to do so is unreasonable on its face. The Symphony SPEC.md has no conformance test suite, no formal grammar, and sections that Gonzalez describes as reading like "an agent's work product: lacking coherence, purpose, or understanding of the bigger picture."

Where this breaks for teams buying agentic coding tools

A solo developer or an agency owner evaluating agentic coding tools in 2026 is hearing two pitches simultaneously. Pitch one: agents will write your code, and you manage them by writing specs. Pitch two: agentic coding makes developers more productive by handling implementation details while humans focus on architecture.

Gonzalez's analysis collapses pitch one. A specification document detailed enough to produce working code is roughly as expensive to write as the code, requires the same engineering judgment to get right, and offers no compilation, no type checking, and no automated testing to catch mistakes along the way. The Symphony experiment demonstrates a spec that was detailed and intentional — and still could not generate a working result.

Pitch two survives, but with a critical constraint that most tool marketing glosses over: the human still needs to read and evaluate the code. Dex Horthy, an engineer who builds developer tools, summarized the tension on X in response to Gonzalez's post: "A spec that is sufficiently detailed to generate code with a reliable degree of quality is roughly the same length and detail as the code itself. So don't review those things, just review the code at that point." His conclusion — find a way to steer the model before it produces thousands of lines, not after — captures the gap in current agentic tooling that no spec format has solved.

The procurement question nobody is asking

If your team is evaluating an agentic coding platform that markets spec-to-code as a core workflow, ask for the conformance data. What percentage of generated implementations pass the project's own test suite on the first run? What is the average number of human correction cycles before the output is production-ready? How does spec length scale relative to codebase size as project complexity increases?

Those questions will almost certainly not have answers yet, because the industry has not standardized how to measure spec-driven generation quality. That absence is itself informative. The agentic coding space is selling a workflow — specs in, working software out — without the instrumentation to verify whether the workflow actually works at the scale customers care about.

Gonzalez's closing line applies beyond Symphony: "Specifications were never meant to be time-saving devices." For anyone budgeting engineering hours around the assumption that specs will shrink them, that sentence is worth taping to the monitor.

AI Pilot Readiness Checklist

Turn the idea into a pilot you can defend.

AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

OpenAI's new realtime voice models turn speech into a workflow interface

May 28, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026

GPT-5.4 Mini and Nano Turn Coding Agents Into a Cost Discipline

March 17, 2026

Keep Reading

OpenAI's new realtime voice models turn speech into a workflow interface

May 28, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026

GPT-5.4 Mini and Nano Turn Coding Agents Into a Cost Discipline

March 17, 2026

Industry Insights

OpenAI's Symphony Proved the Opposite of What It Was Supposed to Prove

Sean McLellan

Lead Architect & Founder

March 18, 20265 min read

That outcome alone would be a useful data point. But Gonzalez went further and examined why it failed, and her findings turn the entire "specs replace code" argument inside out.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

OpenAI's new realtime voice models turn speech into a workflow interface

May 28, 2026

17 enterprise vendors signed on to one stack. Then a ToS clause told you who owns what goes wrong.

March 17, 2026

GPT-5.4 Mini and Nano Turn Coding Agents Into a Cost Discipline

March 17, 2026