Industry Insights

The Context Window Trap: New 172-Billion-Token Study Reveals How AI Hallucinations Triple as Context Grows

The biggest document Q&A hallucination study to date found a hard truth for SMBs: bigger context windows do not make your AI safer. In many cases, they make fabrication much worse.

Sean McLellan

Lead Architect & Founder

March 11, 20266 min read

A lot of small businesses are being sold the same AI story right now: upload your docs, point a chatbot at them, and let it answer customer or employee questions.

That pitch leaves out the part that actually matters.

A new paper, "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms," tested 35 models across 172 billion evaluation tokens. It is the largest study we have seen on hallucinations in document-grounded question answering. Its headline finding is not subtle: when context gets large, hallucinations get worse.

That should make every business using document Q&A stop and rethink what "good enough" really means.

The big number is not the whole story

The study found the best model hit a 1.19% fabrication rate under optimal conditions at 32K context. That is impressive, but it is also not zero. Top-tier models more commonly landed in the 5% to 7% range. The median across all 35 models was about 25% fabrication even on questions where the answer actually was in the document.

That last number is the one SMB buyers should sit with for a minute.

If you are using AI to answer questions about policies, contracts, product documentation, onboarding materials, or customer support articles, a fabrication rate in that range is not a minor UX issue. It is a trust problem. Customers do not care whether the model was "mostly grounded." They care whether the answer was wrong.

The surprise villain is the context window

Most vendors market context size like horsepower. Bigger number, better machine.

This study suggests that framing is backwards for a lot of real deployments.

At 128K context, fabrication nearly tripled. At 200K tokens, every single model tested exceeded a 10% fabrication rate. No exceptions. That means the feature many buyers treat as a safety blanket can turn into the thing that makes answers less reliable.

Why does that matter for SMBs? Because the standard implementation advice is often: throw more documents in, include more history, widen the retrieval scope, and let the model sort it out.

That is not a safety strategy. That is a way to bury the answer in noise and hope the model behaves.

If your vendor is leading with context window size but cannot show how accuracy changes at 32K versus 128K versus 200K in your actual workflow, you are not looking at a finished solution. You are looking at a demo with a nice spec sheet.

Retrieval is not the same thing as restraint

Another useful finding in the paper: grounding accuracy and fabrication resistance are separate capabilities.

That is a big deal. A model can be very good at locating relevant facts in a document and still invent answers when the document does not support the claim. The paper notes that some models scoring above 90% on retrieval-related performance could still fabricate answers to roughly half of questions about content that was not actually present.

In plain English, a model can look smart, find the right paragraph, and still lie.

That is why "our eval says it retrieves well" is not enough. Good retrieval does not automatically mean safe answers. If a system is answering customer-facing or employee-facing questions, you need to test both: can it find the right source, and can it refuse to make things up when the source is missing, ambiguous, or out of scope?

What good enough actually looks like

For most SMB use cases, the practical takeaway is pretty simple.

Keep context tighter than you think.

This study makes a strong case that context under 32K tokens is manageable territory. Once you push into the 128K to 200K range, you are in much riskier ground. Bigger windows may still be useful for certain workflows, but they should not be treated as an automatic quality upgrade.

The paper also found that model family predicted fabrication resistance better than raw model size. So no, you cannot scale your way out of this by buying a larger model and calling it solved.

Good enough usually looks like a narrower retrieval step, fewer documents per answer, strong citation requirements, and hard refusals when evidence is weak. It also looks like testing on your real documents instead of trusting leaderboard vibes.

What SMBs should do now

If you are deploying document Q&A internally or externally, here is the practical checklist:

Keep retrieved context lean. More text is not automatically safer.
Test your system at the actual context lengths your product uses, not just on a short happy path.
Measure fabrication separately from retrieval accuracy.
Require cited answers for anything policy, legal, billing, compliance, or customer-impacting.
Design for abstention. "I could not verify that from the provided documents" is often the right answer.
Choose models based on observed behavior in your workflow, not parameter count or marketing claims.

The market has spent a year treating giant context windows like a universal upgrade. This paper is a reminder that bigger is not the same as better.

If your AI stack answers questions from business documents, the job is not to cram more context into the prompt. The job is to get reliable answers without giving the model extra room to improvise.

If you want help pressure-testing a document AI workflow before it starts making up answers in front of customers, get in touch.

AI Pilot Readiness Checklist

Turn the idea into a pilot you can defend.

AI agent articles are easy to bookmark and hard to operationalize. Use the readiness questions as a shared way to decide whether a workflow is specific enough, safe enough, and measurable enough to pilot. If they surface a strong candidate, BaristaLabs can review it with you and help shape a first version that fits your systems, approval process, and risk tolerance.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Do You Still Need a Separate RAG Stack If Gemini Embedding 2 Maps Audio, Video, and Docs Together?

March 11, 2026

BREAKING: OpenAI to Hire OpenClaw Founder Peter Steinberger & Core Team

February 15, 2026

BREAKING: Moonshot AI Launches Kimi Claw — OpenClaw Comes to the Browser

February 15, 2026

Keep Reading

Do You Still Need a Separate RAG Stack If Gemini Embedding 2 Maps Audio, Video, and Docs Together?

March 11, 2026

BREAKING: OpenAI to Hire OpenClaw Founder Peter Steinberger & Core Team

February 15, 2026

BREAKING: Moonshot AI Launches Kimi Claw — OpenClaw Comes to the Browser

February 15, 2026

Industry Insights

The Context Window Trap: New 172-Billion-Token Study Reveals How AI Hallucinations Triple as Context Grows

The biggest document Q&A hallucination study to date found a hard truth for SMBs: bigger context windows do not make your AI safer. In many cases, they make fabrication much worse.

Sean McLellan

Lead Architect & Founder

March 11, 20266 min read

A lot of small businesses are being sold the same AI story right now: upload your docs, point a chatbot at them, and let it answer customer or employee questions.

That pitch leaves out the part that actually matters.

That should make every business using document Q&A stop and rethink what "good enough" really means.

The big number is not the whole story

That last number is the one SMB buyers should sit with for a minute.

The surprise villain is the context window

Most vendors market context size like horsepower. Bigger number, better machine.

This study suggests that framing is backwards for a lot of real deployments.

Why does that matter for SMBs? Because the standard implementation advice is often: throw more documents in, include more history, widen the retrieval scope, and let the model sort it out.

That is not a safety strategy. That is a way to bury the answer in noise and hope the model behaves.

Retrieval is not the same thing as restraint

Another useful finding in the paper: grounding accuracy and fabrication resistance are separate capabilities.

In plain English, a model can look smart, find the right paragraph, and still lie.

What good enough actually looks like

For most SMB use cases, the practical takeaway is pretty simple.

Keep context tighter than you think.

The paper also found that model family predicted fabrication resistance better than raw model size. So no, you cannot scale your way out of this by buying a larger model and calling it solved.

What SMBs should do now

If you are deploying document Q&A internally or externally, here is the practical checklist:

Keep retrieved context lean. More text is not automatically safer.
Test your system at the actual context lengths your product uses, not just on a short happy path.
Measure fabrication separately from retrieval accuracy.
Require cited answers for anything policy, legal, billing, compliance, or customer-impacting.
Design for abstention. "I could not verify that from the provided documents" is often the right answer.
Choose models based on observed behavior in your workflow, not parameter count or marketing claims.

The market has spent a year treating giant context windows like a universal upgrade. This paper is a reminder that bigger is not the same as better.

If your AI stack answers questions from business documents, the job is not to cram more context into the prompt. The job is to get reliable answers without giving the model extra room to improvise.

If you want help pressure-testing a document AI workflow before it starts making up answers in front of customers, get in touch.

AI Pilot Readiness Checklist

Turn the idea into a pilot you can defend.

Turn this into a pilot plan Talk through a pilot candidate with BaristaLabs

Please do not submit PHI, customer records, credentials, or confidential workflow exports.

Practical AI Workflow Notes

Want more practical AI operations ideas?

Get short notes on applying AI inside real small-business workflows — from document handling and customer follow-up to internal reporting, compliance, and automation guardrails.

Share this post

Share on X Share on LinkedIn Share on Bluesky

Do You Still Need a Separate RAG Stack If Gemini Embedding 2 Maps Audio, Video, and Docs Together?

March 11, 2026

BREAKING: OpenAI to Hire OpenClaw Founder Peter Steinberger & Core Team

February 15, 2026

BREAKING: Moonshot AI Launches Kimi Claw — OpenClaw Comes to the Browser

February 15, 2026