AI Development

The useful part of PostHog's agent PR was the launch checklist

A PostHog production-readiness PR shows the controls teams should prove before agents get write access: isolation, events, approvals, auth, egress, and live tests.

Sean McLellan

Lead Architect & Founder

June 9, 20267 min read

On June 3, a PostHog pull request put a useful name on a problem most teams eventually hit: the agent demo worked, but the production path still had holes.

The PR was not framed as a model upgrade or a clever instruction tweak. It was framed as production-readiness wiring: sandbox, event bus, approvals, gateway. The problem statement was blunt. The agent platform services were deployable in dev, but several pieces still blocked real workloads end to end.

That is the useful part.

A dev-ready agent can call tools, stream progress, and maybe complete a happy path. A production-ready agent has to survive the boring infrastructure questions: where untrusted code runs, where events go, who approves risky actions, which boundary authenticates requests, which network paths are allowed, and whether the live path has actually been exercised.

If your agent can touch customer data, write to business systems, send messages, or trigger money-moving work, those questions are not cleanup. They are the launch checklist.

The gap between a demo and a real agent

The PostHog PR lists blockers that should feel familiar to anyone moving agents out of a lab environment.

One blocker was sandboxing. The custom-tool sandbox defaulted to in-process, which meant author JavaScript ran in the runner pod with no isolation. That may be fine for a local spike. It is a bad default once tools can run arbitrary or semi-trusted code.

PostHog's fix included a Modal sandbox path. Modal describes Sandboxes as secure containers for executing untrusted user or agent code. The important word is not "Modal." The important word is "containers." Production agent work needs a separate place to run code that can fail, loop, reach for the wrong file, or behave differently than expected.

Another blocker was event delivery. Runner and ingress could fall back to an in-memory SessionEventBus when REDIS_URL was unset. That works only while the producer and consumer share a process. Put runner and ingress on different pods and the illusion breaks: an SSE connection on one ingress pod can miss events from a runner on another pod.

Redis describes pub/sub as a pattern where publishers send messages to channels and subscribers receive messages from channels of interest. You do not need Redis specifically, but you need something with that shape. Process-local memory is not an event system for distributed work.

The approvals issue is even sharper. PostHog had implemented PgApprovalStore, but the PR says it was never constructed in production entrypoints. The result: every requires_approval tool was silently ungated, and /approvals/* returned 503.

That is the kind of bug that does not show up in a demo unless the demo includes the uncomfortable path. It appears when a real workflow asks, "Before this agent does the thing, where does the approval actually live?"

The PR also called out dead egress plumbing and an allowlist warning that was not threaded into runtime context. In other words, the policy language existed somewhere, but the runtime boundary was elsewhere.

Agent launch checklist with sandbox, event bus, approval gate, gateway, and test path represented as separate control stations. — A production agent launch checklist should prove each control path before agents get more autonomy.

The launch checklist hiding inside the PR

The PR is useful because it names the gates. Not in consultant language. In production language.

Before an agent gets write access or customer-impacting tools, a team should be able to prove six controls.

1. Sandbox isolation

The first question is simple: where does agent-run code execute?

If the answer is "inside the same process that runs our worker," the agent is still in demo territory. That code path can share memory, filesystem assumptions, credentials, package context, and failure modes with the worker.

A production checklist should ask:

Can agent-authored or user-authored code run outside the runner process?
Does the production entrypoint refuse unsafe sandbox backends?
Are sandbox credentials supplied through the deployment environment, not local defaults?
Can the sandbox be cleaned up after a session?
Do tests cover the sandbox contract, not just the caller?

PostHog's PR moved production selection away from in-process and toward Modal or Docker-backed execution. The exact backend matters less than the enforcement point. A safe option that exists but is not selected by production does not protect anything.

For operators, this belongs in the same conversation as your AI workflow security review. Which tools can run code? Which secrets can that code see? Which files, hosts, or APIs can it reach? Who owns the blast radius when it behaves badly?

2. Shared event and stream path

Agents are long-running work. They plan, call tools, wait, ask for approvals, recover, and report progress. If the user interface depends on events, the event path is part of the product.

A memory-only event bus often survives the first demo because everything runs together. It fails when the system scales in the most ordinary way: multiple pods, separate workers, separate ingress, rolling deploys, or a job resumed by a different process.

A production checklist should ask:

Can a runner publish events that any ingress instance can deliver?
Does production fail at boot if the shared bus is missing?
Are local defaults clearly limited to development and tests?
Can operators trace a session event from worker to stream?
Does the UI degrade honestly if the stream is unavailable?

This is not just an engineering hygiene issue. It changes what the business can trust. If an agent is working on a customer request and the stream drops half the state transitions, the operator sees a false story.

That false story usually turns into manual follow-up, duplicate work, or a bad approval decision.

3. Approval store wired into production

Approval logic has two halves: policy and persistence.

Policy says which tools require review. Persistence says where the request waits, who decided, when they decided, and what the agent did afterward.

The dangerous middle state is "approval-aware code" that is not actually wired into the runtime. PostHog's PR called this out directly: requires_approval tools were silently ungated because the production entrypoints never constructed the approval store.

A production checklist should ask:

Does every production runner receive the same approval store dependency?
Do approval-required tools block when the store is unavailable?
Do approval APIs return real pending, approved, and rejected state?
Is there an audit trail for the request, decision, actor, and resulting tool call?
Are approval failures noisy, or can the agent continue silently?

This is where teams should slow down. An approval system that fails open is worse than no approval system, because it creates confidence without control.

If you are designing this path, start with the approval queue guide and an explicit AI approval policy. The goal is not to make every agent action painful. The goal is to separate low-risk autonomy from actions that need a human gate.

4. Gateway, auth, and routing boundary

Agent platforms accumulate entrypoints quickly: web UI, API, MCP, webhooks, background workers, approval endpoints, tool callbacks, and internal admin routes.

The gateway is where that mess either becomes a boundary or becomes folklore.

PostHog's PR included gateway routing and auth wiring as part of production readiness. That matters because agents do not just need permissions inside the model loop. They need permissions at the edges where sessions start, approvals are decided, tools are invoked, and events are read.

A production checklist should ask:

Which service accepts agent session requests?
Which identity is attached to the session?
Which routes are public, internal, or admin-only?
Can approval endpoints be called only by authorized users or systems?
Are tool callbacks scoped to the session and tenant?
Does routing preserve the same auth context the policy engine expects?

Agent permissions often fail by drift. The UI checks one thing. The worker assumes another. A callback route has a shortcut because it was "internal." Then the agent gets a path nobody meant to expose.

Treat routing as part of the control plane, not plumbing.

5. Egress and data boundary

Agents become risky when they can pull from one place and push to another.

That risk is not theoretical. A web fetch tool can request a URL from the runner's network position. If that network can see private services, metadata endpoints, cluster-local hosts, or internal dashboards, the agent may be able to retrieve data that no user intended to expose through the tool.

The PostHog PR cleaned up dead egress proxy fields and noted that infrastructure-level smokescreen owned egress. That is a healthy distinction. A warning string in a tool description is not a boundary. Runtime enforcement is a boundary.

A production checklist should ask:

Which outbound hosts can agent tools reach?
Is egress enforced at infrastructure level, application level, or both?
Are private, link-local, cluster-local, and metadata addresses blocked?
Do tool descriptions match the real boundary?
Are denied egress attempts logged?
Can a workflow owner review data movement before launch?

This is also where agent specs are getting more serious. The Open Envelope schema includes fields for access policy, required secrets, adapters, human gates, run events, audit trails, secret isolation, and prompt-injection concerns. Whether or not a team uses that schema, the direction is right: agent work is becoming infrastructure work.

For a practical launch packet, pair this with an agent receipt: what the agent accessed, what it changed, which approvals it requested, and which external systems it touched.

6. Live-path test, not just type checks

The most useful detail in the PostHog thread was not the checklist itself. It was the follow-up test.

A PR comment says a live Modal end-to-end test caught a real bug: Modal's exec() rejected timeoutMs: 0 with "timeoutMs must be positive," even though the type definition implied zero meant no timeout. The implementation had passed req.timeoutMs ?? 0 unconditionally. The fix was to pass timeoutMs only when positive.

That is exactly why production readiness cannot stop at TypeScript checks, unit tests, lint, and build. Those are necessary. They do not prove the deployed path works.

PostHog's listed testing included TypeScript checks, unit tests, lint, build, and an end-to-end harness with 158 passed and 7 skipped. The live Modal test added something different: it exercised the external runtime that production depended on.

A production checklist should ask:

Has the same sandbox backend used in production run a real tool invocation?
Has the approval path been tested through the production entrypoint?
Has the event stream been tested across separate processes or pods?
Has auth been tested on the actual route, not just the service method?
Has an egress denial been tested?
Does the test fail in the same way production would fail?

The phrase "end to end" gets abused. For agent systems, it should mean the agent crosses the same boundaries it will cross in production.

A one-page launch packet beats a model debate

Many teams still treat an agent launch as a model and tool selection exercise.

Which model? Which framework? Which vector database? Which agent library? Which instruction style?

Those choices matter, but they are not the launch gate. A weaker model behind strong controls is usually safer than a stronger model wired into a vague runtime with no approval path and unclear network access.

Before giving an agent more autonomy, build a one-page launch packet:

Control	What to prove	Evidence to attach
Sandbox isolation	Agent-run code cannot execute inside the production worker process by default.	Runtime config, boot failure behavior, sandbox e2e result.
Shared events	Runner events reach the UI or API across separate processes or pods.	Shared bus config, stream test, missing-config boot failure.
Approvals	`requires_approval` tools block until a stored decision exists.	Approval policy, approval store wiring, API test, audit record.
Gateway and auth	Session, approval, event, and tool routes enforce the right identity and scope.	Route map, auth test, tenant and session scoping checks.
Egress boundary	Agent tools can reach only approved outbound destinations.	Allowlist or blocklist config, denial test, security review notes.
Live-path test	The production runtime has handled at least one real tool path.	E2E run, external sandbox or tool result, failure-mode notes.

Keep it short. If the packet gets long, the launch probably is not ready. The point is to make the implicit gates visible enough that an engineering lead, operator, and security reviewer can argue about the same facts.

This also helps with permission fatigue. If every action asks for approval, people stop reading. If nothing asks for approval, the agent can do real damage. The launch packet should define which actions are autonomous, which require review, and which are not allowed at all.

The operator version of production readiness

For business teams, the takeaway is not "use Modal" or "use Redis" or "copy PostHog's architecture."

The takeaway is simpler: do not let an agent graduate from demo to production until the control path is as real as the happy path.

If the agent can run code, prove isolation.

If the agent streams state, prove shared delivery.

If the agent needs approvals, prove the production runner cannot skip them.

If the agent has routes, prove auth at the edge.

If the agent can fetch or send data, prove the egress boundary.

If the agent depends on an external runtime, run a live-path test against it.

That is a better launch conversation than "which model should we use?"

BaristaLabs helps teams design and ship these controls for real workflows, especially where agents touch operations, customer work, or internal systems. Start with the Responsible AI hub, review your approval design with the approval queue guide, or use the AI workflow security review worksheet before expanding an agent's permissions.

If the workflow is ready to move from experiment to production, our process automation work can help turn the checklist into a deployed control path.

Implementation help

Moving an agent from demo to production?

BaristaLabs helps teams design the approval, security, event, and receipt paths around one narrow AI workflow before it receives more autonomy.

Design a safer agent launch

Best fit for teams adding agents to operations, customer work, internal tools, or engineering workflows.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your AI Workflow Tests Should Try to Falsify the Promise

June 10, 2026

MCP is removing protocol sessions. Your application still owns the state.

July 24, 2026

Smartsheet’s MCP server shows why valid tool output can still be incomplete

July 17, 2026

AI Development

The useful part of PostHog's agent PR was the launch checklist

A PostHog production-readiness PR shows the controls teams should prove before agents get write access: isolation, events, approvals, auth, egress, and live tests.

Sean McLellan

Lead Architect & Founder

June 9, 20267 min read

On June 3, a PostHog pull request put a useful name on a problem most teams eventually hit: the agent demo worked, but the production path still had holes.

That is the useful part.

If your agent can touch customer data, write to business systems, send messages, or trigger money-moving work, those questions are not cleanup. They are the launch checklist.

The gap between a demo and a real agent

The PostHog PR lists blockers that should feel familiar to anyone moving agents out of a lab environment.

The launch checklist hiding inside the PR

The PR is useful because it names the gates. Not in consultant language. In production language.

Before an agent gets write access or customer-impacting tools, a team should be able to prove six controls.

1. Sandbox isolation

The first question is simple: where does agent-run code execute?

A production checklist should ask:

Can agent-authored or user-authored code run outside the runner process?
Does the production entrypoint refuse unsafe sandbox backends?
Are sandbox credentials supplied through the deployment environment, not local defaults?
Can the sandbox be cleaned up after a session?
Do tests cover the sandbox contract, not just the caller?

2. Shared event and stream path

Agents are long-running work. They plan, call tools, wait, ask for approvals, recover, and report progress. If the user interface depends on events, the event path is part of the product.

A production checklist should ask:

Can a runner publish events that any ingress instance can deliver?
Does production fail at boot if the shared bus is missing?
Are local defaults clearly limited to development and tests?
Can operators trace a session event from worker to stream?
Does the UI degrade honestly if the stream is unavailable?

That false story usually turns into manual follow-up, duplicate work, or a bad approval decision.

3. Approval store wired into production

Approval logic has two halves: policy and persistence.

Policy says which tools require review. Persistence says where the request waits, who decided, when they decided, and what the agent did afterward.

A production checklist should ask:

Does every production runner receive the same approval store dependency?
Do approval-required tools block when the store is unavailable?
Do approval APIs return real pending, approved, and rejected state?
Is there an audit trail for the request, decision, actor, and resulting tool call?
Are approval failures noisy, or can the agent continue silently?

This is where teams should slow down. An approval system that fails open is worse than no approval system, because it creates confidence without control.

4. Gateway, auth, and routing boundary

Agent platforms accumulate entrypoints quickly: web UI, API, MCP, webhooks, background workers, approval endpoints, tool callbacks, and internal admin routes.

The gateway is where that mess either becomes a boundary or becomes folklore.

A production checklist should ask:

Which service accepts agent session requests?
Which identity is attached to the session?
Which routes are public, internal, or admin-only?
Can approval endpoints be called only by authorized users or systems?
Are tool callbacks scoped to the session and tenant?
Does routing preserve the same auth context the policy engine expects?

Treat routing as part of the control plane, not plumbing.

5. Egress and data boundary

Agents become risky when they can pull from one place and push to another.

A production checklist should ask:

Which outbound hosts can agent tools reach?
Is egress enforced at infrastructure level, application level, or both?
Are private, link-local, cluster-local, and metadata addresses blocked?
Do tool descriptions match the real boundary?
Are denied egress attempts logged?
Can a workflow owner review data movement before launch?

For a practical launch packet, pair this with an agent receipt: what the agent accessed, what it changed, which approvals it requested, and which external systems it touched.

6. Live-path test, not just type checks

The most useful detail in the PostHog thread was not the checklist itself. It was the follow-up test.

That is exactly why production readiness cannot stop at TypeScript checks, unit tests, lint, and build. Those are necessary. They do not prove the deployed path works.

A production checklist should ask:

Has the same sandbox backend used in production run a real tool invocation?
Has the approval path been tested through the production entrypoint?
Has the event stream been tested across separate processes or pods?
Has auth been tested on the actual route, not just the service method?
Has an egress denial been tested?
Does the test fail in the same way production would fail?

The phrase "end to end" gets abused. For agent systems, it should mean the agent crosses the same boundaries it will cross in production.

A one-page launch packet beats a model debate

Many teams still treat an agent launch as a model and tool selection exercise.

Which model? Which framework? Which vector database? Which agent library? Which instruction style?

Before giving an agent more autonomy, build a one-page launch packet:

Control	What to prove	Evidence to attach
Sandbox isolation	Agent-run code cannot execute inside the production worker process by default.	Runtime config, boot failure behavior, sandbox e2e result.
Shared events	Runner events reach the UI or API across separate processes or pods.	Shared bus config, stream test, missing-config boot failure.
Approvals	`requires_approval` tools block until a stored decision exists.	Approval policy, approval store wiring, API test, audit record.
Gateway and auth	Session, approval, event, and tool routes enforce the right identity and scope.	Route map, auth test, tenant and session scoping checks.
Egress boundary	Agent tools can reach only approved outbound destinations.	Allowlist or blocklist config, denial test, security review notes.
Live-path test	The production runtime has handled at least one real tool path.	E2E run, external sandbox or tool result, failure-mode notes.

The operator version of production readiness

For business teams, the takeaway is not "use Modal" or "use Redis" or "copy PostHog's architecture."

The takeaway is simpler: do not let an agent graduate from demo to production until the control path is as real as the happy path.

If the agent can run code, prove isolation.

If the agent streams state, prove shared delivery.

If the agent needs approvals, prove the production runner cannot skip them.

If the agent has routes, prove auth at the edge.

If the agent can fetch or send data, prove the egress boundary.

If the agent depends on an external runtime, run a live-path test against it.

That is a better launch conversation than "which model should we use?"

If the workflow is ready to move from experiment to production, our process automation work can help turn the checklist into a deployed control path.

Implementation help

Moving an agent from demo to production?

BaristaLabs helps teams design the approval, security, event, and receipt paths around one narrow AI workflow before it receives more autonomy.

Design a safer agent launch

Best fit for teams adding agents to operations, customer work, internal tools, or engineering workflows.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Your AI Workflow Tests Should Try to Falsify the Promise

June 10, 2026

MCP is removing protocol sessions. Your application still owns the state.

July 24, 2026

Smartsheet’s MCP server shows why valid tool output can still be incomplete

July 17, 2026