Quick path
In this article
Quick read: what changed, why it matters, and what to do next.
The invoice script is only off by one field.
An AI agent has written a tiny transformation to normalize vendor names, split line-item notes, and tag anything that looks like a pass-through expense. The code is short enough for a developer to read between meetings. It passes the sample file. The finance lead wants to try it on next week's intake queue.
The dangerous part is not the script.
The dangerous part is that nobody in the room can answer a basic question: what is this script allowed to touch?
Can it read every invoice field or only the fields passed into it? Can it call a pricing service? Can it write back to the accounting system? Can it make a network request? Can it inspect another tenant's configuration? Can it run for five seconds, five minutes, or until a worker dies?
That is where agent tooling is starting to move. The early phase was "let the model write code." The next phase is "give agent-written code a tiny, explicit capability sandbox."
A recent Elixir project, Lua.ex, is a useful signal. Not because every business should suddenly embed Lua. Because it gives operators a concrete way to talk about a problem that is about to show up in more AI workflows: agent-written code needs a contract before it gets runtime power.
What Lua.ex signals
Lua.ex describes itself as "Sandboxed Lua 5.3 on the BEAM" and "built for AI agents." Its homepage frames the runtime as an Elixir-native Lua 5.3 VM for embedding untrusted code, including AI agent tools, user-supplied formulas, and per-tenant plugins. It also makes a few claims that matter to security reviewers: zero NIFs, zero shelling out, every opcode auditable, and sandboxed by default.
The tv-labs/lua GitHub repository describes the project more plainly: "A Lua 5.3 runtime in pure Elixir." The README says the lexer, parser, register-based VM, and standard library all run directly on the BEAM. It positions the library for safely running untrusted scripts, including AI-agent-authored code, game logic, user-defined rules, configuration, and plugins.
The line that should catch an automation lead's eye is this one from the README: "Giving an AI agent a sandboxed runtime where it can only call the Elixir functions you expose is a primary use case."
That is the shift.
The model may still write a formula, transformation, or plugin. But the host application decides which functions exist inside that script's world. The script does not get the whole machine. It gets a small room, a few named tools, and rules on the door.
Lua.ex also gives a specific example of what "sandboxed by default" can mean. The README says Lua.new/1 sandboxes dangerous standard-library paths by default, including os.execute, os.exit, os.getenv, file I/O through io.*, require, load, and dofile. If a script calls a sandboxed function, the runtime raises instead of touching the host.
That detail matters more than the star count. Still, the project is no longer a quiet experiment. On June 12, the public repository showed 194 stars, 8 forks, 24 open issues, and a push earlier that day. The Hex package describes it as "A Lua VM implementation in Elixir."
The v1.0.0-rc.2 release, published June 10, is also telling. The release notes mention table-heavy workload improvements of 28-37% at n=1000, protected-call and error-value correctness fixes, and one ordinary known issue: deep recursion remained about 25% slower than rc.0 and still needed work before 1.0.0 final.
That combination is the signal: sandboxing, host-controlled capabilities, error behavior, performance, and correctness are all being treated as runtime concerns. Not just model-message concerns.
The operator artifact is a sandbox contract
A security review usually asks whether a tool is allowed. A sandbox contract asks a sharper question: what exact powers does this piece of agent-written code receive at runtime?
For a business team, the contract should be readable without becoming a compiler manual. It should name the script's job, the data it receives, the functions it can call, the paths it cannot touch, the limits around execution, and the evidence a reviewer gets afterward.
Think of it as the missing middle layer between an approval policy and a code sandbox.
The approval policy says whether the workflow is allowed to exist. The sandbox contract says what the running script can actually do.

A useful manifest might look like this:
| Field | What it answers | Example |
|---|---|---|
| Script purpose | What job is this code allowed to perform? | Normalize invoice line items and return expense tags. |
| Input schema | What data enters the sandbox? | vendor_name, line_item_description, amount, currency, memo. |
| Output schema | What must the script return? | normalized_vendor, expense_category, confidence, review_note. |
| Allowed host functions | Which application functions can the script call? | lookup_vendor_alias, get_category_rules, emit_review_note. |
| Forbidden paths | What is unavailable even if the language normally supports it? | Shell execution, file I/O, environment variables, dynamic imports, network calls. |
| Data boundary | Which records, tenants, or systems are out of reach? | Current invoice only. No customer table, payroll data, or cross-tenant config. |
| Runtime limits | How long and how large can execution get? | 500 ms CPU budget, fixed memory ceiling, recursion cap, max output size. |
| Audit log | What gets recorded for replay? | Script version, inputs hash, function calls, outputs, errors, reviewer decision. |
| Approval owner | Who approves new capabilities? | Finance systems owner plus security reviewer for any new host function. |
| Rollback behavior | What happens when the script fails or gets revoked? | Queue item returns to manual review; previous approved script remains active. |
The table is intentionally boring. Boring is good here. If a script can call only three host functions, those three functions should have names. If network access is forbidden, the contract should say so. If a new capability requires approval, the owner should be named before the first incident.
This also gives software teams a cleaner implementation target. Instead of asking developers to "make the agent safe," the business asks for a small runtime surface that matches the manifest.
The manifest changes the buying conversation
Most teams evaluating AI automation still talk about agent-written code as if it belongs to the software team alone. That misses the operational risk.
A formula written by an agent can change a price. A script can classify a support ticket into the wrong escalation lane. A plugin can enrich a vendor record with data that should never have left a private system. A transformation can silently drop a field that compliance expected to retain.
Those are business failures before they are technical failures.
The sandbox contract gives each owner a place to speak.
The automation lead can define the workflow purpose. The software team can expose a narrow set of host functions. The security owner can block filesystem, network, environment, and cross-tenant access. The compliance reviewer can require replay logs. The business owner can decide whether a failure sends the item back to a human queue or uses the last approved script.
Without that artifact, the review tends to collapse into trust theater. Someone says the model is good. Someone says the code is short. Someone says there is a human in the loop. None of those answers prove what the script can touch.
Approval queues handle decisions; sandboxes handle execution
BaristaLabs has written before about writing an AI approval policy before choosing an agent. That policy defines which actions need human approval, which actions can run automatically, and which actions should not be delegated at all.
We have also covered Apache Burr and inspectable agent runs, where actions, transitions, state, observability, persistence, human-in-the-loop pauses, and testing/replay make the run itself easier to inspect.
A sandbox contract sits below both.
The run map shows where the workflow went. The approval queue shows which decision a human accepted or rejected. The sandbox contract shows what the executable code was capable of doing at the moment it ran.
Those layers answer different audit questions:
- Did the workflow follow the approved path?
- Did a human approve the right checkpoint?
- Did the script stay inside its declared capability boundary?
- Can the team replay the failure with the same input, version, and host-function calls?
For teams building reviewable AI operations, these should connect. A proposed agent-authored script should appear in an approval queue with its manifest attached. A completed run should preserve enough evidence to replay what happened. If the script failed, the reviewer should see whether it failed because the logic was wrong, the input was malformed, or the sandbox blocked a forbidden capability.
That last category is especially useful. A blocked capability is not always an incident. Sometimes it is the sandbox doing its job.
What to ask before agent-written scripts reach production
Before approving an agent-authored script, ask for two things: one capability manifest and one failure replay.
The manifest should be short enough for a business owner to read. If it takes a security architect to understand it, the contract is probably hiding too much. The business owner should be able to point to a line and say, "This script cannot call the network," or "This script can only update the review note, not the invoice."
The failure replay should show what happens when something goes wrong. Use a malformed input, a forbidden function call, a timeout, or a bad output shape. The replay should prove that the workflow fails into review, not into silent mutation.
A practical review can start with these questions:
- What exact host functions can the script call?
- Which standard-library paths are removed or blocked?
- Can the script read files, environment variables, network resources, or other tenants' data?
- What input schema and output schema does the runtime enforce?
- What happens on timeout, recursion depth, memory pressure, or malformed output?
- What gets logged for audit and replay?
- Who approves a new host function?
- How does the team revoke a bad script and return work to the last safe version?
None of this requires a specific language runtime. Lua.ex is one example of the pattern: a small embeddable language, a default sandbox, and host-exposed functions instead of broad machine access. The same contract idea applies to formulas, workflow extensions, spreadsheet-like rules, LLM-generated SQL helpers, and per-tenant plugins.
The business value only counts if the workflow is reviewable, bounded, and recoverable.
If your team is evaluating agent workflows that may execute code, use the sandbox contract as a buying and implementation artifact. Pair it with an approval policy, a run map, and a replayable failure case. If you need a place to start, BaristaLabs' AI workflow security review worksheet can help your team map the data boundary and approval surface before you automate the risky lane.
Before the next agent-written script goes live, ask for the contract. Then ask someone to break it on purpose.
AI Pilot Readiness Checklist
Turn the idea into a pilot you can defend.
AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.
Please do not submit PHI, customer records, credentials, or confidential workflow exports.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
