A team has one AI policy and four agents.
One agent summarizes internal docs. One drafts customer emails. One updates CRM records after a human approves the change. One auto-closes low-risk support tickets when the answer matches an approved policy.
The policy says: "AI agents must not take action without human review."
That sounds safe until the doc summarizer gets stuck waiting for approval to summarize a page it was already allowed to read. The email drafter becomes slower than a template. The CRM updater works, but reviewers approve too quickly because every change looks routine. The ticket-closing agent keeps acting after the edge case appears because nobody gave it a stop condition.
One policy is now doing four jobs badly.
That is where AI agent governance breaks. Teams argue about whether agents should be "trusted" or "locked down" when they should be asking a narrower question first: what level of autonomy does this agent actually have?
Gartner's warning is about mismatched control
Gartner warned on May 26, 2026 that applying one uniform governance policy across AI agents will lead to enterprise failures.
The headline prediction is blunt: Gartner expects that by 2027, 40% of enterprises will demote or decommission autonomous AI agents because governance gaps only become visible after production incidents.
The useful part is not the prediction. It is the diagnosis.
Gartner says teams fail when they do not distinguish between an agent's ability to act and the scope of access it receives. Shiva Varma, senior director analyst at Gartner, described the common mistake as treating agent governance as binary: either locked down or fully trusted.
That binary model does not survive contact with real workflows.
A read-only research assistant, an agent that recommends a refund, an approval-gated billing update agent, and an autonomous support resolution agent do not need the same controls. They need controls proportional to their autonomy.
That starts with an agent autonomy map.
The artifact: an agent autonomy map
An agent autonomy map is a plain inventory of every AI agent by permission level, business process, systems touched, failure mode, controls, owner, and evidence.
It does not need to be fancy. It does need to be explicit.
For each agent, write down:
- What the agent can read
- What the agent can recommend
- What the agent can write, send, delete, change, or trigger
- Whether a human approves each action
- What logs prove what happened
- What stops the agent when something looks wrong
- Who owns the process when the agent fails
This is not policy theater. It is the difference between "we use AI responsibly" and "this specific agent may update a renewal date in HubSpot only after an account manager approves the proposed field change."
NIST's AI Risk Management Framework uses a map, measure, manage, govern structure for AI risk. The language can feel heavier than most SMB teams need day to day, but the operating idea is useful: you cannot manage a risk you have not mapped.
The same applies to agents. Before an agent acts, map its autonomy.
Level 1: observe
A Level 1 agent can read defined data sources and produce output for the requester. It does not recommend an action as a decision. It does not write back to a business system.
Examples:
- Summarize a policy document
- Search internal knowledge base articles
- Compare contract clauses for a legal reviewer
- Pull recent support themes into a weekly digest
The agent can still cause damage. It can read data it should not see. It can summarize private information into a place where it does not belong. It can give a clean answer from stale or incomplete material. It can leak sensitive context through logs, prompts, plugins, or shared workspaces.
The controls are mostly data boundary controls:
- Scoped data access
- Authentication
- Usage logging
- Basic functional testing
- Basic security testing
- Clear output visibility, usually only to the requester
A Level 1 agent should not need a heavy approval workflow. If every doc summary needs manager approval, the policy is probably too strict. Spend the control budget on access boundaries and logging instead.
BaristaLabs usually treats this as a data security problem before an automation problem. If the agent is allowed to read the wrong corpus, no prompt can fix that.
Level 2: advise
A Level 2 agent drafts, recommends, scores, ranks, or proposes. Humans still execute the work manually.
Examples:
- Draft a customer reply for a support rep
- Recommend which invoices need follow-up
- Suggest next steps for an onboarding project
- Generate a renewal risk summary for a customer success manager
The agent can influence decisions without touching the system of record. That makes Level 2 feel safer than it is.
The common failure is quiet over-reliance. A rep accepts the suggested reply because it sounds polished. A manager trusts the churn-risk summary because it has confident bullet points. A finance assistant follows up on the wrong invoice because the agent misread the account status.
Level 2 controls should test output quality, not just access:
- Hallucination testing
- Domain-specific evaluations
- Sampling against real cases
- User training on when not to rely on the answer
- Clear labeling that the output is a recommendation
- Review expectations for high-risk categories
The point is not to make people afraid of every recommendation. It is to stop "the agent suggested it" from becoming a substitute for judgment.
NIST's Generative AI Profile is useful here because it pushes teams toward explicit measurement and controls for genAI risks. Vibes are not a test plan.
Level 3: act with approval
A Level 3 agent can write data, send communications, modify configurations, trigger workflows, or prepare transactions, but every action needs explicit human approval before execution.
Examples:
- Update CRM fields after a reviewer approves the proposed changes
- Send a customer email after an account owner reviews it
- Create a refund request for approval
- Change a SaaS configuration after an admin approves the diff
This is where many teams think they are safe because "a human is in the loop."
That phrase hides a lot of bad workflow design.
A human approval step is only meaningful if the reviewer can understand the action, the risk, the evidence, and the consequence. If the approval screen says "Agent wants to update customer record. Approve?" the reviewer is rubber-stamping, not governing.
Level 3 needs an actual AI agent approval workflow, not a button bolted onto an agent demo.
A good approval queue shows:
- The exact proposed action
- The before and after state
- The reason the agent proposed it
- The data sources used
- The confidence or evaluation result, if available
- The business impact
- The rollback path
- The receipt that will be logged after approval
This is where AI agent audit trails matter. If a customer asks why their renewal date changed, the answer cannot be "the agent did it." You need the request, approver, timestamp, input evidence, action payload, and resulting state.
We built our approval queue design around that idea because Level 3 is the first point where agent work becomes operationally real. It affects customer records, systems, communications, and money.
Approval fatigue is a Level 3 risk
If the agent asks for approval too often, reviewers stop reading. If the queue is full of trivial actions, the high-risk action blends in. If every approval looks the same, the reviewer cannot tell what deserves attention.
That is not a people problem. It is a design problem.
Teams should separate approval lanes by risk:
| Action type | Approval pattern |
|---|---|
| Low-risk, reversible field update | Batch review or sampled review after enough evidence |
| Customer-facing message | Human review before send |
| Financial adjustment | Named approver, reason code, audit receipt |
| Permission or configuration change | Admin approval with diff and rollback |
| Policy exception | Escalation to process owner |
This does not mean skipping controls. It means matching review effort to consequence.
If a Level 3 agent generates 200 approvals a day and 195 are low-value confirmations, the system is training people to click. Fix the queue before expanding the agent.
A simple AI approval policy template can help teams define which actions need review, who can approve them, and what evidence must be shown.
Level 4: act autonomously
A Level 4 agent acts inside defined guardrails without approval for every action. Humans review exceptions, logs, outcomes, and trend reports.
Examples:
- Auto-close low-risk support tickets when the answer matches an approved policy
- Reorder routine supplies within a spend limit
- Route inbound leads based on defined qualification rules
- Pause a campaign when spend crosses a threshold and performance drops below a set floor
This is where "agent" stops being a productivity feature and starts becoming part of operations.
The risks change. You are no longer asking whether a human will approve one action. You are asking what happens when the system takes 400 actions before anyone notices the pattern is wrong.
Level 4 controls need to live outside the prompt:
- Continuous monitoring
- Enforced guardrails
- Rollback mechanisms
- Circuit breakers
- Exception queues
- Rate limits
- Clear ownership
- Agent-specific incident response
OWASP's Agentic AI Threats and Mitigations is a useful security reference here because agentic systems combine model behavior with tools, permissions, memory, and external systems. Prompt instructions are only one layer.
OWASP's prompt injection guidance also matters. If an agent can read untrusted content and then act, someone can try to steer it through that content. A support ticket, webpage, email, shared doc, or scraped browser result can become an instruction source.
For Level 4 agents, "we told it not to do that" is not a control. The control is a permission boundary, validator, allowlist, spending limit, rollback plan, or circuit breaker that still works when the model is wrong.
This is also why browser agents deserve their own readiness test. CAPTCHAs, login flows, page changes, hidden instructions, and brittle workflows can all break the neat demo path. We wrote about that in our browser-agent readiness test.
A first-pass worksheet for this week
You can build the first version of an agent autonomy map in a spreadsheet.
Do not start with tooling. Start with the workflows people are already asking agents to touch.
| Field | Question to answer |
|---|---|
| Agent name | What do people call this agent? |
| Business process | Which workflow does it support? |
| Autonomy level | Observe, advise, act with approval, or act autonomously? |
| Systems accessed | What can it read? What can it write? |
| Action types | What actions can it prepare or execute? |
| Human role | Who reviews, approves, monitors, or owns the output? |
| Failure mode | What would hurt a customer, employee, account, system, or dollar? |
| Required controls | Access scope, evals, approval queue, audit trail, rollback, circuit breaker |
| Evidence | What logs, tests, evals, receipts, or reports prove it is working? |
| Expansion rule | What evidence must exist before autonomy increases? |
Then force every proposed agent into one row.
| Agent | Level | Right control |
|---|---|---|
| Internal policy summarizer | Observe | Scoped access and usage logging |
| Support reply drafter | Advise | Quality evals and rep review |
| CRM update agent | Act with approval | Approval queue, before/after diff, receipt log |
| Low-risk ticket closer | Act autonomously | Guardrails, monitoring, rollback, owner, circuit breaker |
The worksheet will expose uncomfortable gaps quickly.
If nobody owns the agent, it is not ready. If the rollback plan is "manual cleanup," write that down. If the approval step does not show enough evidence to approve intelligently, it is not a real approval workflow. If the agent can act on customer data but has no receipt log, stop before production.
For teams already building AI into workflows, this map pairs well with production evaluations. We covered that angle in our post on AI agent governance and evaluations in production.
Autonomy is earned
The mistake is granting autonomy because the demo looked good.
A clean demo proves the agent can complete the happy path once. It does not prove the agent should have write access, customer contact rights, configuration privileges, or autonomous execution.
Autonomy should be earned from evidence:
- The agent works on real cases, not just examples
- The error modes are known
- The approval workflow catches meaningful mistakes
- The receipt log can explain what happened
- The rollback plan has been tested
- The circuit breaker has a clear trigger
- A human owner is accountable for the process
That is the practical version of AI agent governance.
Map the autonomy first. Grant the access second. Expand only when the evidence says the agent has earned it.
If your team is deciding where agents can safely help, start with the process map, permission model, approval queue, and audit trail before building the impressive demo. That is the work that keeps automation useful after it reaches production.
AI Pilot Readiness Checklist
Turn the idea into a pilot you can defend.
AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.
Please do not submit PHI, customer records, credentials, or confidential workflow exports.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
