Android Bench cut 38,989 GitHub pull requests down to 100 tasks, then made 58% of the final set libraries even though 63% of Android repositories on GitHub are applications. That one choice matters more than the headline leaderboard.
Google’s top-line result is clean enough: Gemini 3.1 Pro Preview leads Android Bench at 72.4%, ahead of Claude Opus 4.6 at 66.6% and GPT-5.2-Codex at 62.5%. Most coverage stopped there.
The more useful read is in the methodology page. Google says it leaned the benchmark toward libraries on purpose so models would face “more restrictions, modularity, and architectural patterns.” In other words, this is not a simple “which model builds Android apps best?” chart. It is closer to “which model survives the maintenance work that wrecks velocity in older Android stacks?”
For a solo dev running Android Studio with Kotlin, Compose, Room, Hilt, and a few inherited modules, that distinction is the whole game. Greenfield app demos are cheap. Untangling navigation migrations, Gradle config drift, permission handling, and SDK breakage is where the bill shows up.
The leaderboard is really a maintenance leaderboard
Google did not build Android Bench out of toy prompts. The company filtered down from 38,989 pull requests, required tests, kept changes from the last three years, and ended with a benchmark whose median task size was 32 changed lines. Nearly half the benchmark sits under 27 changed lines.
That sounds small until you remember what Android maintenance work usually looks like. The expensive tickets are rarely giant rewrites. They are the annoying, local, dependency-sensitive changes that break one module, one build variant, or one edge-case permission flow.
That is why the library skew matters. A model that wins here is not just good at spitting out Compose screens. It is better at working inside constraints: modular code, architectural seams, tests, and existing abstractions.
Google biased the test toward modern Android pain on purpose
The benchmark categories also give the game away. Google explicitly prioritized Jetpack Compose, Coroutines and Flows, Room, and Hilt, then added frequent help-request zones like navigation migrations, Gradle and build configuration, SDK breaking changes, system UI, camera, media, foldables, and granular runtime permissions.
That is a very specific shape of Android work. It favors teams dealing with real codebases, not prompt engineers building fresh demo apps from scratch.
It also explains why the gap between first and third place is not trivial. Gemini 3.1 Pro Preview’s 72.4% score is 9.9 points above GPT-5.2-Codex. On a set of 100 tasks, that is roughly 10 more benchmark tasks cleared. If your shop sees 20 nasty Android maintenance tickets in a quarter, even a fraction of that gap matters. At a blended contractor rate of $140 an hour, avoiding three escalations that would each eat three hours is about $1,260 back in your pocket.
Where this benchmark can still fool you
Android Bench is strong, but it is opinionated. Google admits it selected more complex repositories, especially libraries and Jetpack Compose projects, to test against current architectural standards. That makes the benchmark better for modern Android engineering and worse as a universal proxy for every mobile team.
If your business still lives in a mostly View-based app with thin test coverage and one giant app module, do not read this leaderboard like a buying guide for all mobile work. The dataset still includes 59% View-based tasks, but its center of gravity leans toward teams already moving closer to Google’s preferred architecture.
That makes the benchmark directionally useful and operationally dangerous if you overread it. A local services company with a four-year-old field app may care less about leaderboard rank than about whether a model can patch Gradle weirdness without trashing legacy XML screens.
The stack decision for a small mobile shop
If you run a two-to-five person agency or product team, the right move is not “switch everything to the winner today.” It is narrower.
Use Android Bench as a filter for which model deserves your first serious maintenance trial. Take five real tickets from your backlog: one Gradle problem, one navigation migration, one persistence bug, one permission edge case, and one UI fix touching either Compose or Views. Run the same workflow through your current toolchain and one of the top two Android Bench models. Score it on patch quality, test pass rate, and how much cleanup a senior engineer has to do before merge.
That gives you the number that matters: not a public leaderboard score, but your rework rate inside your own stack.
Google built a benchmark that quietly says the hard part of AI coding is not generation. It is surviving modular Android maintenance. If you buy anything from this leaderboard, buy the model that makes that work less brittle.
