Patrick Klitzke, an engineer at Mercari, goes looking for one deprecated call. It's the kind of ticket that starts small: a library method the team wants gone, a pattern that needs to change everywhere it appears. He runs the search across the company's repositories, and the number comes back larger than expected: around 80. "I found around 80 potential repos affected," he says.
Eighty is not a number you fix by hand on a Friday. It's also not a number most teams have a real process for, even now that an AI agent can plausibly write all eighty pull requests. The interesting question was never whether the agent could produce a correct-looking diff. It's what happens between repo one and repo two: the moment a team decides whether the first success was a rehearsal or a green light for the other seventy-nine.
What actually shipped
That Mercari example comes from Sourcegraph, which took its existing Batch Changes tool and, on June 30, 2026, put an agentic layer in front of it for public beta. Batch Changes itself isn't new: Sourcegraph's documentation describes an Enterprise feature that has spent years shipping scripted changes, dependency bumps, and CVE patches across many repositories and code hosts at once, tracking each pull request through review and merge.
The new part is the agent sitting in front of the script. Instead of writing one transform and hoping it fits every repo, a team can describe a migration and let an agent scope which repositories are affected, execute the change, react to what CI reports back, and keep iterating until a pull request is actually mergeable, not just opened. Sourcegraph's own framing is blunt about where the difficulty sits: "The hard part is not knowing what change to make. The hard part is executing it safely at scale."
That's the right diagnosis. A senior engineer can usually describe the fix for a deprecated API in a sentence. What eats the week is the twelfth repo that pins an old build tool, the twenty-third that a different team owns and reviews on its own schedule, and the forty-fourth where the "same" function call sits inside a slightly different abstraction. Sourcegraph's pitch is that the agent can absorb that variance instead of a script choking on it. Klitzke described why a plain find-and-replace wasn't enough: "With the help of Agentic Batch Changes, you're able to handle repos that have similar, but not identical setups. A normal scripted change would most likely be a text search and replace operation without any context of how it's actually used." When the pattern gets that inconsistent, the tool can hand the harder repos to a coding agent (Sourcegraph names Claude Code and Codex as options through its own model-routing layer) while leaving the uniform ones to a deterministic script.
To Sourcegraph's credit, the design keeps a human in the loop at the merge line: "Engineers review and approve every changeset before it merges." That's the right gate to have. It is also not, by itself, the gate that determines whether a fleet-wide migration goes well.
The gate that actually matters is earlier
Review-before-merge catches a bad diff in one repository. It does nothing to catch a bad assumption baked into the migration plan itself, because that assumption looks fine in every individual review. Each reviewer is looking at their own repo, their own diff, their own green CI run. Nobody on that list is positioned to notice that the agent's approach quietly depends on a directory layout, a config convention, or a dependency version that only holds for the repos it happened to try first.
Call it the "green first repo, broken fleet" pattern. It shows up in a predictable order. The pilot repo passes: clean diff, clean tests, quick approval. Repo two looks almost the same, so the second review goes almost as fast. By repo ten, the reviews have become procedural. Everyone assumes the pattern holds because it held nine times already. Then the migration hits a repo where a different team owns the code, where CI runs a slower suite that hasn't reported back yet, where the package manager pins a version the script didn't expect, or where a config file encodes a production setting nobody wrote down anywhere the agent could read it. The individual PR for that repo might still look reasonable on its face. What broke wasn't the diff. It was the assumption that repo one's success generalized, and nobody had written down the point at which the team was supposed to stop and check.
This is not a knock on Sourcegraph specifically, and it isn't unique to agentic tooling. Scripted Batch Changes runs have hit the same wall for years. What's new is the speed. An agent that can scope, execute, and iterate across hundreds of repositories can also generalize a wrong assumption across hundreds of repositories, faster than any human reviewer will notice the pattern.
The fix isn't a smarter agent. It's deciding, in writing, what "the pilot worked" is allowed to authorize before the second pull request opens.
Write the fleet contract before repo two
A fleet-wide migration contract is a short, specific document that a team fills out once per migration, before an agent runs across more than one repository. It doesn't replace code review. It sits above it, answering the questions that no single repo's reviewer is positioned to ask: which repos are actually in scope, what the pilot proved and didn't prove, when the rollout has to stop on its own, and who owns putting a merged change back the way it was.

The fleet-wide migration contract
Multi-repo rollout packet
The fleet-wide migration contract
Write this once, before an agent opens a second pull request, and treat every answer as something the first repo has to earn.
- 01
Migration intent
Pins down: The exact change, in one sentence, and the reason it has to happen now
Why it matters:If the intent is vague, every repo's edge case becomes a reason to improvise a different fix.
- 02
Repo discovery query and exclusion list
Pins down: How the agent finds affected repositories, and which repos are off the list no matter what the query returns
Why it matters:A search pattern that finds 80 repos will also find the three you never meant to touch.
- 03
First-repo rehearsal result
Pins down: The specific repo used as the pilot, and what actually happened when the change ran there
Why it matters:A pilot is evidence about one repo, not a verdict on the other seventy-nine.
- 04
Script path vs. coding-agent judgment path
Pins down: Which repos get a scripted, deterministic transform and which ones need an agent to read context and decide
Why it matters:The repos that need judgment are the ones most likely to hide the assumption that breaks the fleet.
- 05
CI-log access boundary
Pins down: What test and build output the agent can read per repo, and what it cannot
Why it matters:An agent that can't see why a repo's CI failed will guess, and a wrong guess repeats across every repo like it.
- 06
Stop conditions
Pins down: The specific failure counts, error types, or repo categories that halt the rollout automatically
Why it matters:"Keep going unless something looks wrong" is not a stop condition. A number is.
- 07
Code-owner routing
Pins down: Who reviews each changeset, mapped to the team or person who actually owns that repository
Why it matters:A single approver skimming eighty PRs is a rubber stamp with extra steps.
- 08
Changeset status board
Pins down: One place that shows every repo's state: not started, rehearsed, opened, merged, blocked, or rolled back
Why it matters:Without a shared board, "how's the migration going" gets answered from memory instead of from the fleet.
- 09
Rollback owner
Pins down: The named person who can revert a merged changeset, and the path they use to do it
Why it matters:Rollback plans that live only in the agent's session disappear the moment something needs reverting at 2 a.m.
- 10
Evidence packet
Pins down: What each merged changeset leaves behind: the diff, the test result, the reviewer, and the repo-specific exception it hit
Why it matters:Six months from now, someone will need to know why repo 41 looks different from the other seventy-nine.
One green repo is a data point. Eighty green repos is a claim someone has to be willing to defend.
A few of these fields do more work than they look like on paper.
The first-repo rehearsal result is the one teams skip fastest, because it feels like the migration has already started once repo one is open. Treat it instead as a standalone question with a written answer: what, specifically, did this repo prove? If the pilot repo happened to be the simplest one in the fleet (no unusual dependencies, one clear owner, a fast CI suite), say so in the contract. That fact should lower everyone's confidence in how far the result generalizes, not raise it.
Stop conditions need to be numbers, not moods. "Three consecutive CI failures," "any repo owned by the payments team," and "any changeset that touches a file modified in the last 14 days" are stop conditions an agent or a pipeline can act on without a human parsing tone. "Pause if it looks risky" is not a stop condition. It's a hope that someone will notice in time.
Code-owner routing matters because review quality degrades with volume in a specific, predictable way: the fifth review in an afternoon gets less attention than the first, and the fortieth gets almost none. Routing each changeset to the team that actually owns the repository, rather than to whoever is available to click approve, keeps the review meaningful instead of ceremonial.
This is close in spirit to the ticket-level contracts we've written about for agent work on shared project boards: access, evidence, and a human gate, spelled out per task instead of assumed. The fleet contract is the same discipline at a different altitude. A ticket contract governs what one agent may do on one piece of work. A fleet contract governs the order in which a migration is allowed to spread across an entire estate, and what has to be true before it spreads any further.
It also complements, rather than replaces, the other controls a migration needs. ScarfBench's three-receipt model (compile, deploy, behavior) is the right way to judge whether one migrated repository actually works. The fleet contract answers a different question: assuming a repo passes those gates, is it safe to let the same change run on the next seventy-nine unattended? A dependency no-fly list keeps an agent away from packages it shouldn't touch during any single change. The fleet contract keeps a whole rollout from outrunning the humans who are supposed to be watching it.
Start with the repo you'd never let go first
If your team is looking at a Sourcegraph-style rollout, or building the same pattern with scripts and an in-house agent, resist the instinct to pilot on the easiest repository in the fleet. An easy pilot proves the happy path works, which you already suspected. Pilot instead on a repo with a quirk that worries you a little: an unusual dependency, a slower CI suite, a different code owner. Write the contract before that repo's pull request opens.
If the rehearsal repo clears every stop condition cleanly, you've learned something real about how far the pattern generalizes. If it doesn't, you've found the edge case on one repository instead of on the forty-fourth one, after the review habit has already gone numb.
If you're weighing an agentic migration tool against a fleet of repositories and want a second set of eyes on the contract before the first pull request opens, bring BaristaLabs one migration candidate and we'll help you write the discovery query, the stop conditions, and the rollback path before the agent gets to repo two.
Fleet rollout
Write one fleet migration contract before the agent opens repo two
Bring one migration candidate that touches multiple repositories. BaristaLabs will help define the discovery query, rehearsal gate, stop conditions, code-owner routing, and rollback owner before an agent runs it at scale.
Best fit for teams with many repositories or internal systems considering AI-assisted code maintenance.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
