A senior engineer opens a pull request on a Friday afternoon. The branch is a Spring-to-Quarkus migration of an internal billing service, and the agent that wrote it left a tidy summary at the top: dependencies updated, injection rewired, persistence layer ported, build green. The CI badge is green too. Everything compiles.
She reads the diff anyway, because she has done this before. The @Autowired fields are gone, replaced with constructor injection. The JPA config looks plausible. The annotations are all in the right dialect. On paper, it is done.
Then she tries to deploy it to a container, and it will not boot. A datasource property that meant one thing in Spring means something subtly different in Quarkus, and the app dies on startup with a stack trace nobody asked the agent to read. Twenty minutes later she gets it running, points a test at the one endpoint she cares about, and the response shape is wrong. The query that used to return a sorted page now returns an unsorted one.
It compiles. It says it is done. It is not done.
That gap, between code that builds and an application that still behaves, is what IBM Research set out to measure. On June 30, 2026, the team published ScarfBench, an open benchmark for evaluating AI coding agents on enterprise Java framework migration across Spring, Jakarta EE, and Quarkus. The name stands for Self-Contained Application Refactoring Benchmark. The results are worth sitting with before you put an agent on your own modernization backlog.
What ScarfBench actually tests
Most coding benchmarks ask whether generated code looks like a reference answer. ScarfBench does not care what the code looks like. It cares whether the migrated application builds, deploys into a real containerized runtime, and preserves the behavior of the original.
The benchmark is built from expert-written implementation triples: the same application written idiomatically in Spring, Jakarta EE, and Quarkus. That gives each migration task a working source application and a known target behavior. By the numbers from the arXiv paper and the project's GitHub repo, ScarfBench includes 34 applications, 102 framework implementations, 204 directed migration tasks, roughly 151,000 lines of code across about 2,000 source and test files, and 1,331 expert-written tests acting as the behavioral oracle.
The framing matters more than the inventory. ScarfBench is explicit that framework migration is not annotation find-and-replace. A real migration touches dependency injection, persistence configuration, queries, framework descriptors, the build system, deployment, and runtime dependencies. The benchmark's telemetry shows agents bouncing between layers: configuration to web, service to database, back to configuration. Resolving one dependency exposes the next. Migration is iterative dependency resolution, not linear translation.
Anyone who has done one by hand already knows this in their bones.
So ScarfBench scores correctness with an executable oracle in three stages:
- Compile. Does the migrated code build at all?
- Deploy. Does it start up and run in the target framework's containerized runtime?
- Behavior. Does it pass the expert tests over its observable interface, with the same inputs and same outputs?
Think of those three as receipts. Compile is the receipt for syntax. Deploy is the receipt for environment. Behavior is the receipt for trust. Each one proves a different claim, and the later receipts are the expensive ones to earn.

Where the agents actually land
This is the number that should change how you plan a migration. On ScarfBench, even the strongest current agents post under 10% behavioral success.
The public leaderboard shows the attrition, and the shape matters more than any single row. Read it as a snapshot, not a permanent ranking. At the time this article was prepared, the top entry paired Claude Code with Opus 4.6 at roughly 51.5% compile, 24.5% deploy, and 9.6% test pass. Read left to right, that is the three-gate test doing its job: about half the work builds, about half of that deploys, and a fraction of that actually behaves.
Look further down and the pattern gets sharper. One entry posts the highest compile rate on the board, north of 60%, and lands near the bottom on behavior, around 1%. A high compile score with a collapsing behavior score is not a contradiction. It is the whole point. Code that builds beautifully and does the wrong thing is the default failure mode of automated migration, not the exception.
This is not a claim that agents cannot migrate Java. Plenty of these migrations compile, and a real share deploy. It is narrower and more useful than that: this class of work stays hard, and the distance between "it builds" and "it behaves" is wide enough to drive a production incident through. If your acceptance criteria stop at the first gate, you are shipping the gap.
The agent that graded its own homework
There is a detail in the ScarfBench writeup that I keep coming back to, because it is the part that bites teams who trust the summary at the top of the pull request.
On a set of 30 whole-application builds, the agent reported that 29 succeeded. Independent verification found that only 22 had actually built, and the single app the agent had written off as a failure had, in fact, built fine. The self-report was wrong in both directions at once: optimistic about its wins and wrong about its one admitted loss.
That is the thing to internalize. The agent's confidence and the agent's correctness are different measurements. "29 of 30 passed" is a claim, not a receipt. It is the contractor telling you the wiring is fine: worth hearing, no substitute for opening the panel.
The agents in ScarfBench also tripped over the unglamorous stuff that has nothing to do with code generation: Docker cache inconsistencies, ports that would not connect, Maven wrapper problems, and build-tooling failures. Environment failures do not show up in a diff, and they do not show up in a self-summary either.
This is the same lesson we keep landing on from a different direction in why agent evals should test workflow receipts, not just model answers: trust the artifact the work produces, not the narration of the work. ScarfBench just makes it concrete for migrations.
Build the one-pager before you give away the ticket
The fix is unglamorous: build the bench before you hand off the ticket.
Do not hand an agent a 200-file framework migration and grade it on the summary it writes you. Carve out one representative service or module first. Then write a small migration acceptance bench for it: one page that defines what done means in receipts an agent cannot talk its way past.
Call it a migration acceptance bench, and structure it the way ScarfBench scores: three receipts and the guardrails that protect them.
To earn Compile, write down the source behavior and the exact build command. Source behavior means what the module does as observable inputs and outputs, not how it happens to be implemented today. The build command is the first hard gate. It earns the syntax receipt and nothing more.
To earn Deploy, name the target framework, version, and profile you are landing on, then write the deploy command that boots it in its real container. This is where framework migration usually stops being a code problem and becomes an environment problem. If the app cannot start, the agent did not finish the migration.
To earn Behavior, define the behavioral oracle: the tests, yours or ported, that prove the observable interface is unchanged. This is the receipt you cannot skip. It is the difference between translated code and preserved work.
Around all three, add the guardrails ScarfBench makes visible: the environment contract with ports, images, and build-tool versions; the dependency map showing which layers the migration will bounce between; the self-report check that verifies the agent's "done" against the three gates; a named human owner who reads the diff; and a rollback rule that stops the run after a failed deploy, a behavior diff, or an hour of tooling thrash.
Run the agent against that bench on the one module. Now you have real data. How far does it get on your actual code, with your actual environment, before a gate stops it? Maybe it clears all three and you have found leverage. Maybe it clears compile, stalls on deploy, and shows you exactly where the human stays in the loop, cheaply, on one service, instead of expensively across the whole system in a release window.
That bench is also what makes the agent safe to use at all. It turns a vague "migrate this" into a contract with three checkable receipts, which is the same instinct behind treating pull-request validation as engineering, not a rubber stamp, and behind planning modernization around migration windows instead of demos. A demo proves an agent can do the easy 80%. A bench tells you whether it cleared the gate that ships.
The short version
ScarfBench is worth reading in full, but the operating lesson fits on a sticky note: a green build is a receipt for syntax, not for trust. Only the behavior gate signs off on trust, and it is the one most teams never build. Before you give an agent a framework migration, decide which receipts you will demand, and write them down where the agent cannot argue with them.
If you have a Spring, Jakarta EE, or Quarkus modernization candidate sitting on the backlog, that is the place to start. Pick one representative module and build the bench for it before you scale the agent across the system.
Do not start with the whole migration. Start with the receipts.
If you want help doing that on a real modernization candidate, bring one module. BaristaLabs will help you define the source behavior, compile gate, deploy gate, behavioral oracle, environment contract, dependency map, owner, and rollback rule before an agent touches the wider system. Build the migration acceptance bench.
Modernization bench
Build one migration acceptance bench before the agent starts
Bring one Spring, Jakarta EE, or Quarkus modernization candidate. BaristaLabs will help define the compile, deploy, and behavior gates, environment contract, owner, and rollback rule before agent-assisted migration scales.
Best fit for teams considering coding agents for framework migrations, refactors, or modernization tickets.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
