The demo looked good until the invoice portal asked the agent to prove it was human.
The ops lead had watched the browser agent log in, search for the right vendor record, open the billing tab, copy the invoice total, and prepare the update for the finance queue. It was the kind of pilot that makes everyone in the room lean forward. Slow, maybe. A little awkward. But it worked.
Then the real portal pushed back.
A CAPTCHA appeared. Another screen asked for device approval. A later run made it through the challenge but landed in a review state because the session looked unusual. The model had not suddenly become worse. The workflow had revealed a launch risk the demo had hidden.
That risk matters for any team testing browser agents against vendor portals, finance systems, insurance tools, support consoles, healthcare admin sites, or legacy apps without clean APIs. The question is not only "can the agent complete the task?" It is also "does the site accept the way the task gets completed?"
Those are different questions.
The finding that should change your pilot plan
Roundtable Research recently published a useful piece called "CAPTCHAs can still detect AI agents." Their point is more practical than the headline sounds.
They are not saying AI cannot recognize CAPTCHA images. They say vision-language models can often identify typical CAPTCHA content. The issue is that AI systems do not complete the task the same way humans do.
In their research, Roundtable found statistically significant differences in sequential click patterns, direction changes, and overselection behavior. In plain English: the answer may be right, but the path to the answer looks different.
Their phrase is worth keeping: output equivalence does not equal process equivalence.
That sentence belongs in a browser-agent launch review. A business workflow can look successful if you only inspect the final state. The form was filled. The record was updated. The screenshot matches. The receipt is there.
But some websites judge the process too. They watch cursor movement, timing, device signals, session history, repeated paths, challenge behavior, login patterns, and permission prompts. They may not care that your agent is "smart." They care that the session looks strange.
Developer communities are already circling the same issue. Hacker News picked up the Roundtable piece in a thread titled "CAPTCHAs can still detect AI agents", and another recent Show HN thread, "Continue? Y/N: A 60-second game about AI agent permission fatigue", drew hundreds of points and comments.
The exact examples vary, but the operator problem is consistent: agents need control surfaces, not just more capability.
Browser-agent readiness is not model readiness
Most pilots still over-index on task completion.
Can the agent open the right page? Can it find the invoice? Can it read the table? Can it press the right button? Can it recover from a changed label?
Those are good tests. They are not enough.
Browser-agent readiness includes the target website's tolerance for automation. Some systems assume a human is present. Some allow automation only through an API or partner integration. Some permit scripted access until risk signals trip. Some work fine during low-volume demos and fail when the same process runs every morning at 8:05.
A workflow can be technically automatable and operationally unsafe.
This is especially common when the workflow touches vendor portals, bank systems, payroll tools, card programs, customer support consoles, healthcare admin sites, insurance portals, procurement systems, or any web app that triggers MFA, CAPTCHA, device checks, session review, or suspicious activity banners.
If your launch plan assumes the agent can keep trying until it gets through, the plan is already weak. A human who hits friction has judgment. An agent that hits friction needs boundaries.
This is where the earlier BaristaLabs guidance on separate browser profiles for agents becomes more than housekeeping. Isolation helps you see what the agent is actually doing. It also prevents test runs from borrowing trust signals, cookies, extensions, and saved sessions from a human operator's normal browser.
The goal is not to trick the website. The goal is to stop pretending the agent is just a faster employee with a mouse.
Borrow the control pattern from higher-risk systems
One reason the Robinhood example is useful is that it treats agent access as a bounded operating environment.
TechCrunch reported that Robinhood announced support for AI agentic trading and an agentic credit card. Users can create a separate account for AI agents and connect it to a dedicated wallet. Agents can read and analyze portfolios and suggest investments, but orders can only use the pre-loaded balance in that wallet.
The same report says users receive trade notifications, can monitor agent activity, and some trades show a preview that requires approval before execution. Robinhood also describes fraud detection protection and review for suspicious trades or disputes. The agentic card pattern includes monthly limits and an option to require approval every time it makes a payment.
Do not read that as investment advice. Read it as a control pattern.
Separate account. Dedicated wallet. Pre-loaded limit. Notifications. Monitoring. Approval for higher-risk actions. Fraud review.
Most business browser-agent pilots need the same shape, even if the task is less dramatic than trading stocks. A vendor portal agent should not use a founder's personal login. A finance agent should not have unlimited ability to submit payments. A customer support agent should not silently change account state without leaving a record.
Stanford's AI agent guidelines for CS336 make a different version of the same point. Stanford frames agents as teaching assistants, not solution generators. The guidelines allow agents to explain concepts, ask clarifying questions, point to course materials, review code, suggest sanity checks, propose toy examples, recommend assertions, and suggest profiler investigations. They do not allow agents to write Python or pseudocode, complete TODOs, edit the student repo, run bash commands, or implement core assignment components.
That is not a business compliance framework. It is still a helpful example. The boundary is explicit. The agent has a job. The agent also has things it must not do.
Business teams need that discipline before they point an agent at a browser.
A pilot checklist for vendor portal workflows
Before production, run the workflow through a go/no-go review. Keep it boring. Boring is good here.
Use a real workflow, a non-production account where possible, and the same browser profile the agent will use in production. If the agent will run on a schedule, test it on that schedule. If it will run from a server, test it from that environment. If a human will approve final submission, include that step.
The review sheet should answer these questions.
What site signals can stop the workflow?
Document every point where the portal can interrupt the agent.
That includes CAPTCHA, MFA, device approval, password resets, session timeouts, suspicious activity banners, permission prompts, download warnings, blocked popups, file picker dialogs, and review queues.
Do not hide the messy parts under "login friction." Name each failure mode. If the site blocks the agent twice in ten test runs, that is not a weird edge case. It is a launch condition.
What is the approved path if the browser path fails?
A browser agent should rarely be the only path.
Check whether the vendor offers an API, SFTP drop, webhook, partner integration, admin export, email ingestion route, or approved automation method. The browser may still be the fastest pilot path, but the fallback should be known before launch.
This is where computer-using agents for legacy workflows can be useful. Browser automation can bridge old systems. It should not become an excuse to ignore better integration paths when they exist.
What account does the agent use?
Create a separate agent account when the system allows it. Give it the minimum permissions needed. Avoid shared human credentials.
If the portal cannot support a separate account, write that down as a risk. It may still be acceptable for a low-risk read-only workflow. It is harder to defend for payments, customer data, regulated records, or irreversible changes.
For sensitive workflows, connect this decision to your security posture. A browser agent that can see customer records is part of your security model, not a lab experiment. See the BaristaLabs data security guidance for the broader operating posture.
What can the agent change without approval?
Separate read, draft, submit, and irreversible actions.
A useful default is: the agent can gather evidence and prepare a draft; a human approves the state-changing action. If the action is low-risk and reversible, you may relax that later. If the action touches money, customer status, access rights, legal records, or compliance data, keep the approval gate.
The approval gate should not be a vague Slack message. It should show the proposed action, the source evidence, the account used, the time, and the expected result. The approval queue pattern exists for exactly this kind of control.
What logs prove what happened?
If the agent updates a portal, you need receipts.
Capture the input, the decision, the screen or page state that justified the action, the submitted value, the confirmation page, and any portal receipt ID. Save enough context that a human can reconstruct the run later without watching a screen recording.
This is the point of an agent receipts log. The log is not busywork. It is how you debug failures, answer customer questions, and decide whether the agent is ready for a wider rollout.
When does the agent hand off?
Define the handoff rules before the agent gets stuck.
A good handoff rule is specific: "If the portal presents CAPTCHA, MFA, a new permission prompt, a suspicious activity warning, a changed payment amount, or a missing confirmation screen, stop and assign the run to a human reviewer."
A weak handoff rule says: "Ask for help if needed."
Agents are bad at knowing when "needed" has arrived unless you define it. The handoff should include the current page, the last completed step, the intended next action, and the evidence collected so far.
How many clean runs count as ready?
One successful demo is not a production test.
Run the workflow enough times to see ordinary variance: different records, different vendors, different times of day, expired sessions, new downloads, slow pages, missing fields, and partial data. If the workflow runs daily, test the daily rhythm. If the vendor portal changes often, include a monitoring plan.
Readiness is not "the agent did it once." Readiness is "we know what happens when it works, when it gets blocked, and when it should stop."
The go/no-go decision
At the end of the pilot, force a decision.
Go means the workflow has a stable path, bounded permissions, a clear fallback, human review where needed, and enough receipt logging to investigate mistakes.
No-go does not mean the idea failed. It may mean the browser path is the wrong first implementation. Use the vendor API. Ask for a partner integration. Keep the agent in draft-only mode. Move the workflow to a human-assisted queue. Reduce the account permissions. Add a separate profile. Tighten the approval step.
The worst outcome is not "we learned the portal blocks automation." That is useful information.
The worst outcome is putting a browser agent into production because the demo worked once, then discovering the launch depends on a website believing the actor is a normal human.
That is not a CAPTCHA problem. It is an operations problem.
BaristaLabs helps teams choose the first workflow, map agent boundaries, and decide where browser automation belongs versus API or integration work. If you are testing browser agents against real portals, start with the readiness sheet. The model's capability is only one part of the launch. The process has to survive the systems around it.
If that is the work in front of you, our process automation team can help turn the pilot into a controlled workflow instead of a fragile demo.
AI Pilot Readiness Checklist
Turn the idea into a pilot you can defend.
AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.
Please do not submit PHI, customer records, credentials, or confidential workflow exports.
Practical AI Workflow Notes
Want more practical AI operations ideas?
Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.
Share this post
