AI Development

The agent metric that matters: governance gets 12x more AI projects into production

Databricks' 2026 State of AI Agents report points to a practical lesson: governance and evaluations are becoming deployment infrastructure.

Sean McLellan

Lead Architect & Founder

May 24, 20265 min read

The easiest AI agent metric to notice is demo quality. Can it answer the question, draft the email, generate the report, or click through the workflow?

Those tests matter. They just are not the metric that decides whether an agent becomes part of the business.

Databricks' 2026 State of AI Agents report gives operators a more useful signal: companies using evaluation tools get nearly 6x more AI projects into production, and companies using AI governance get over 12x more.

That should change how teams budget agent work. Evaluations, permissions, audit trails, fallback paths, and human approvals are not enterprise paperwork to bolt on after the prototype works. They are deployment infrastructure.

What Databricks found

Databricks says the report is based on more than 20,000 organizations worldwide using the Databricks Data Intelligence Platform, including over 60% of the Fortune 500. Its findings show agents moving beyond isolated chatbot experiments: AI is entering critical workflows, companies are shifting toward multi-agent systems, and multi-agent systems grew 327% in less than four months.

The report also points to a change in data work. Databricks says more than 80% of databases are built by AI agents. In its supporting post on enterprise AI agent trends, Databricks adds that on Neon, agents create 80% of all databases and 97% of database branches.

That is a sign that agents are beginning to act inside the machinery behind software, analytics, and operations. The question changes from "Did the model produce a good answer?" to "Can this agent act safely inside systems the business depends on?"

Databricks also reports that 40% of the top 15 agent use cases focus on customer experience and engagement, including market intelligence, customer advocacy, regulatory reporting, medical literature analysis, and predictive maintenance.

Why the 6x and 12x deployment gap matters for smaller teams

Smaller companies often treat governance as something large enterprises do because they have compliance departments. That framing is backwards.

A midsize manufacturer, agency, clinic, retailer, or professional services firm may not have an AI risk committee, but it still has customer records, financial data, contracts, payment workflows, vendor systems, and brand reputation. If an agent sends the wrong customer response, updates the wrong record, or exposes sensitive data, the business feels it quickly.

This is why the Databricks production gap matters. A nearly 6x lift from evaluation tools suggests teams get more projects live when they can test quality repeatedly instead of relying on one-off demos. An over 12x lift from governance suggests approvals, controls, lineage, and accountability make teams more willing to put agents into real workflows.

For SMB teams, the first phase should include the operating layer: what good output looks like, which actions need approval, what data the agent can read, what systems it can write to, who reviews failures, and how bad actions get rolled back.

Customer experience and operations feel the pressure first

Customer-facing workflows are where agent promises sound concrete: faster support replies, better account research, cleaner handoffs, automated follow-ups, personalized reporting, and proactive issue detection. They are also where weak controls show up fastest.

Each use case depends on trust: the right data, permissions, review path, and record of what happened.

The useful takeaway is that agents are moving into real workflows, not generic chat windows. Once agents sit near customer data, finance tools, or operational systems, governance becomes part of the product experience.

What to put in place before letting agents act

Before an agent gets write access, tool access, or customer-facing responsibility, teams should define a practical control layer. It does not need to be heavy. It does need to be explicit.

Start with evaluations. Build test sets from real examples: past tickets, reports, internal requests, messy edge cases, and known failure modes. Include cases where the correct answer is "I do not know" or "this needs human review." Re-run those evaluations when prompts, models, tools, or source systems change.

Define permissions by task. An agent that drafts a customer email may only need read access and a draft-only handoff. An agent that updates a record needs stricter scopes, approval gates, and rollback steps. The safe default is least privilege.

Create audit trails. Log inputs, retrieved sources, tool calls, approvals, final outputs, and user overrides. When something goes wrong, you need to know whether the failure came from bad source data, missing context, a permission issue, or model behavior.

Plan fallback and rollback before launch. What happens if the agent is uncertain, a connected system is down, a policy check fails, or a human rejects the output? If the agent can change data, how do you reverse the action?

Keep humans in the right places. Put review gates where judgment, liability, customer trust, money movement, or sensitive data are involved. Remove them only after the evaluation record supports it. BaristaLabs' responsible AI principles and data security guidance both start from this premise: good automation needs boundaries, observability, and human oversight.

Practical next steps for this quarter

If your team is evaluating AI agents this quarter, pick one workflow where the business value is visible and the risk is manageable. A good first candidate has a clear owner, repeated volume, accessible source data, measurable quality, and a human review path.

Then build the deployment plan alongside the prototype:

Write the workflow map: trigger, source systems, decisions, outputs, approvals, and failure paths.
Define the evaluation set with real examples and expected outputs.
Set permission scopes for read-only, draft-only, approval-required, and autonomous actions.
Add logging that shows what the agent saw, did, and recommended.
Pilot with a rollback plan and review failures weekly.
Decide what not to automate yet.

For teams choosing a first workflow, AI consulting should focus as much on exclusion as selection: where AI should not go yet, which controls must exist first, and what success will be measured against. For teams ready to connect tools and workflows, process automation planning should include evaluation, permissions, and monitoring from the start.

This is also why early architecture choices matter. We covered a related piece of that decision in the three endpoint decisions that change agent rollouts: the safer rollout defines boundaries before the agent is asked to act.

Governance is deployment infrastructure

The practical read on Databricks' 2026 State of AI Agents is not "agents are coming." They are already being tested in customer experience, reporting, operations, data workflows, and software infrastructure.

Production deployment appears to depend on operating discipline. Evaluation tells teams whether the agent is reliable enough. Governance tells them whether it is controlled enough. Auditability and rollback tell them whether they can recover when something breaks.

Treat those pieces as deployment infrastructure, not paperwork. The teams that do will move beyond better demos and toward AI agents the business can actually trust.

Implementation help

Turn production misses into reviewable gates

BaristaLabs helps teams turn one candidate AI workflow into scoped data boundaries, reviewer evidence, receipts, and rollback paths before production use.

Review the governance gate

Best fit when the team can name one workflow, one owner, and the evidence a reviewer needs before the agent acts.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Agent evals should test workflow receipts, not just model answers

May 25, 2026

Write the AI approval policy before you choose the agent

June 1, 2026

Build the approval queue before you build the agent

May 25, 2026

AI Development

The agent metric that matters: governance gets 12x more AI projects into production

Databricks' 2026 State of AI Agents report points to a practical lesson: governance and evaluations are becoming deployment infrastructure.

Sean McLellan

Lead Architect & Founder

May 24, 20265 min read

The easiest AI agent metric to notice is demo quality. Can it answer the question, draft the email, generate the report, or click through the workflow?

Those tests matter. They just are not the metric that decides whether an agent becomes part of the business.

What Databricks found

Why the 6x and 12x deployment gap matters for smaller teams

Smaller companies often treat governance as something large enterprises do because they have compliance departments. That framing is backwards.

Customer experience and operations feel the pressure first

Each use case depends on trust: the right data, permissions, review path, and record of what happened.

What to put in place before letting agents act

Before an agent gets write access, tool access, or customer-facing responsibility, teams should define a practical control layer. It does not need to be heavy. It does need to be explicit.

Practical next steps for this quarter

Then build the deployment plan alongside the prototype:

Write the workflow map: trigger, source systems, decisions, outputs, approvals, and failure paths.
Define the evaluation set with real examples and expected outputs.
Set permission scopes for read-only, draft-only, approval-required, and autonomous actions.
Add logging that shows what the agent saw, did, and recommended.
Pilot with a rollback plan and review failures weekly.
Decide what not to automate yet.

Governance is deployment infrastructure

Treat those pieces as deployment infrastructure, not paperwork. The teams that do will move beyond better demos and toward AI agents the business can actually trust.

Implementation help

Turn production misses into reviewable gates

BaristaLabs helps teams turn one candidate AI workflow into scoped data boundaries, reviewer evidence, receipts, and rollback paths before production use.

Review the governance gate

Best fit when the team can name one workflow, one owner, and the evidence a reviewer needs before the agent acts.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Turn this idea into a pilot

Which workflow should go first?

Use the readiness check to compare impact, effort, risk, owner, and next step before booking a call.

3-5 minutes
Deterministic score
No sensitive data

Check workflow readiness

Share this post

Share on X Share on LinkedIn Share on Bluesky

Agent evals should test workflow receipts, not just model answers

May 25, 2026

Write the AI approval policy before you choose the agent

June 1, 2026

Build the approval queue before you build the agent

May 25, 2026