Quick path
In this article
Quick read: what changed, why it matters, and what to do next.
A dev lead drops a line in the platform channel: "Can we run Qwen Coder for the internal tools team? They want it for code review and a few document workflows."
Reasonable ask. Open-weight model, real use case, nothing exotic. The platform engineer who owns GPU capacity reads it and starts typing questions instead of a yes.
Where does it run: the cloud account, the neocloud boxes rented by the hour, or the on-prem rack? On what GPUs, and how much memory does it need? Which serving engine? Behind which endpoint, and who owns that endpoint when it breaks? What happens at 5 p.m. when everyone's running code review at once and the capacity is gone?
None of that is in the request. Four words are doing the work of a dozen decisions. Until someone makes those decisions, "can we run this model" cannot be answered. It can only be argued about in a thread.
That gap, between a model someone wants and a fleet someone has to run it on, is exactly where Modelplane sits. It shipped v0.1.0 on June 23, 2026, and it is worth studying not as a tool every team should adopt this week, since its own README says it is an early release, but as a clean map of the seams in self-hosted inference.
What stops being a notebook
For a while, running an open-weight model is a notebook. One GPU, one engineer, one script, and it either fits in memory or it does not. The whole thing lives on one machine and one person's attention.
It stops being a notebook the moment a second team wants in. Now there is a coding model and a document model, two GPU pools, a compliance question about where one of them can run, and a Tuesday afternoon where demand spikes past what you own. Modelplane's docs have a phrase for this: inference becomes a fleet problem across hardware types, providers, regions, and sovereignty constraints, spanning cloud, neocloud, and on-prem environments. Open-weight models are what get you here, because the reason to self-host them is control over cost, governance, and data sovereignty. You take the control; you also take the fleet.
The bet underneath Modelplane is that Kubernetes is becoming the default place that fleet lives, with the cluster handling device-aware scheduling and accelerator management. You do not have to share the whole bet to use the artifact at the center of this piece. You only have to agree that "can we run this model" has stopped being a one-machine question.
Two lanes, and a rule about who names what
Here is the design decision worth slowing down on: Modelplane splits inference into two lanes and gives each lane a different vocabulary.
Modelplane is software you install in your own environment. It is built on Crossplane and runs as a control plane on its own cluster above your inference clusters. Platform teams describe the fleet. ML teams describe the model. Modelplane composes the rest.
The platform team works in capacity. They create the gateway, clusters, and InferenceClass resources. An InferenceClass is a tested recipe for a GPU node pool that bundles device details and optional provisioning. It is a published lane: this is a kind of GPU capacity we run, and here is what is inside it. Those device descriptions follow Kubernetes Dynamic Resource Allocation style, the mechanism Kubernetes uses to match accelerators to workloads.
The ML team works in requirements. They declare what a model needs with a ModelDeployment: the model, the engine, the GPU and memory it requires. Modelplane's scheduler matches those requirements against the capacity the platform team published, schedules replicas, and exposes a unified, OpenAI-compatible endpoint. From there it keeps the fleet pointed at the declared state: provisioning clusters, scaling replicas, caching weights, and routing traffic.
The rule that makes the split real is simple: the ML team never names a cluster. They say what the model needs. The scheduler figures out where that lands. A developer who asks for capacity by name, "put it on the A100 box in us-east," now owns a placement decision they cannot see the whole board for. Modelplane takes that pencil out of their hand on purpose.
That is the line worth carrying into your own org, even if you never install any of this:
Example
Do not ask for a model. Ask for a lane it can run in.
The artifact: an inference deployment ticket
This is not a post-run audit log. It is not a generic approval form. It is a request made before any GPU gets bought, any hosted endpoint gets committed, or any internal model API gets exposed. Call it an inference deployment ticket.
You do not need Modelplane to use the ticket, and it will not compile into anything. It is the conversation the four-word request skipped, written down once so the handoff from "we want this model" to "it is serving traffic" becomes portable instead of re-argued every time. If you do run Modelplane, most of these fields map onto a ModelDeployment and an InferenceClass. If you do not, they map onto a meeting you were going to have anyway, but with sharper questions.
Scroll sideways to see all 3 columns.
| Field | The question it forces | Why it bites later if you skip it |
|---|---|---|
| Model and business job | Which model, doing what work for whom? | "A coding model" and "Qwen Coder for PR review on the payments repo" need different sign-off. |
| Serving engine / container | What actually runs the weights? | The engine sets throughput, memory, and which GPUs even qualify. |
| Hardware class and memory need | Which published capacity lane, how much VRAM? | This is the field that matches a request to real GPUs. Vague here means "does not fit" at deploy time. |
| Fleet boundary | Cloud, neocloud, on-prem, which region? | The boundary is a cost and latency decision wearing a geography costume. |
| Sovereignty / compliance | Any rule about where this data and model may run? | Decide it before traffic flows, not after a customer's data lands somewhere it should not. |
| Scaling target and latency budget | Peak concurrent load, acceptable p95? | "It is slow at 5 p.m." is a missing number, not a surprise. |
| Endpoint shape and owner | What URL do callers hit, and who is paged for it? | An endpoint nobody owns is an outage with no name on it. |
| Overflow / break-glass provider | Where does traffic go when your capacity is gone? | Capacity runs out. The only question is whether you decided the fallback in advance. |
| Cache / weight location | Where do the model weights live and load from? | Cold weight pulls are a latency and egress bill you feel on every scale-up. |
| Rollback and retirement rule | How do you back it out, and when does it die? | Models you cannot retire become load-bearing by accident. |

The picture is the whole idea in one frame. Two lanes that do not collapse into one pile: the platform team's capacity on one side, the ML team's model request on the other. They meet at exactly one place, a controlled endpoint, and that meeting point is the only thing a calling app ever sees.
Two fields people get wrong
Most of the ticket is straightforward once you are staring at it. Two rows are where teams quietly lose.
The break-glass provider. Modelplane lets a ModelEndpoint point at an inference endpoint it does not run, such as a hosted provider like Together or Baseten. It also lets a single ModelService front your own replicas and that external endpoint behind one URL for overflow or break-glass failover. That is a real architectural option hiding in one ticket field. It means "self-hosted" does not have to mean "down when we are out of GPUs." But only if someone named the fallback ahead of time and confirmed the model is allowed to run there. The overflow you did not plan is the data boundary you did not mean to cross.
The endpoint owner. Modelplane's ModelService selects endpoints by label and creates the route that turns a deployment into a URL. Mechanically clean. Organizationally, a URL that real software depends on needs a human whose name is on it, not because the tool requires it, but because the pager does. Leave this blank and you have built an internal API that everyone calls and no one maintains.
Both rows are also where monitoring stops being optional. Once an endpoint fronts a couple of replicas and an overflow provider, "is the output any good right now" becomes its own thing to watch: the observability quality lane problem, sitting one layer up from placement. And the moment a data boundary is in play, the sovereignty row should send you to tighten your data boundaries before traffic flows, not after.
What this costs you to try
The honest version: Modelplane is a v0.1.0 release, Apache-2.0 licensed, and its README says APIs and behavior can change between releases. It is a sharp picture of the problem and an early tool for it, not something to put under production traffic on a Friday. Treat the code as a way to see the seams. Treat the ticket as something you can use Monday.
The cheaper move, before any of it, is to run the economics. Self-hosting an open-weight model earns its keep on some workloads and quietly loses to a hosted endpoint on others. That is a build-vs-buy question worth working honestly before you wire up a fleet to serve it.
Fill one ticket before you spend a dollar
Pick one model-serving workflow you are about to stand up: the coding assistant, the document classifier, the support summarizer, whichever one is closest to real. Before you buy a GPU, commit to a hosted plan, or expose an internal model API, fill the ticket for it. All ten rows. If you cannot fill a row, you have found the decision that was going to bite at deploy time, surfaced cheaply, on paper, while it is still free to change your mind.
When the request that reaches your platform team is a lane instead of a guess, "can we run this model" stops being a thread and becomes a yes.
If you want a hand turning one fuzzy request into a ticket someone can act on, bring us one workflow and we will fill it with you, or map the whole serving path as part of process automation. Either way, you leave with a lane, not a favor.
Before you buy GPUs
Turn "can we run this?" into a lane
Bring one model-serving workflow. BaristaLabs will help you fill the ticket: model, capacity, endpoint, fallback, and rollback.
Best fit when an ML or dev team wants to self-host and platform ops owns the GPUs.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Turn this idea into a pilot
Which workflow should go first?
Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.
- 3-5 minutes
- Deterministic score
- No sensitive data
Share this post
