A lot of small businesses are being sold the same AI story right now: upload your docs, point a chatbot at them, and let it answer customer or employee questions.
That pitch leaves out the part that actually matters.
A new paper, "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms," tested 35 models across 172 billion evaluation tokens. It is the largest study we have seen on hallucinations in document-grounded question answering. Its headline finding is not subtle: when context gets large, hallucinations get worse.
That should make every business using document Q&A stop and rethink what "good enough" really means.
The big number is not the whole story
The study found the best model hit a 1.19% fabrication rate under optimal conditions at 32K context. That is impressive, but it is also not zero. Top-tier models more commonly landed in the 5% to 7% range. The median across all 35 models was about 25% fabrication even on questions where the answer actually was in the document.
That last number is the one SMB buyers should sit with for a minute.
If you are using AI to answer questions about policies, contracts, product documentation, onboarding materials, or customer support articles, a fabrication rate in that range is not a minor UX issue. It is a trust problem. Customers do not care whether the model was "mostly grounded." They care whether the answer was wrong.
The surprise villain is the context window
Most vendors market context size like horsepower. Bigger number, better machine.
This study suggests that framing is backwards for a lot of real deployments.
At 128K context, fabrication nearly tripled. At 200K tokens, every single model tested exceeded a 10% fabrication rate. No exceptions. That means the feature many buyers treat as a safety blanket can turn into the thing that makes answers less reliable.
Why does that matter for SMBs? Because the standard implementation advice is often: throw more documents in, include more history, widen the retrieval scope, and let the model sort it out.
That is not a safety strategy. That is a way to bury the answer in noise and hope the model behaves.
If your vendor is leading with context window size but cannot show how accuracy changes at 32K versus 128K versus 200K in your actual workflow, you are not looking at a finished solution. You are looking at a demo with a nice spec sheet.
Retrieval is not the same thing as restraint
Another useful finding in the paper: grounding accuracy and fabrication resistance are separate capabilities.
That is a big deal. A model can be very good at locating relevant facts in a document and still invent answers when the document does not support the claim. The paper notes that some models scoring above 90% on retrieval-related performance could still fabricate answers to roughly half of questions about content that was not actually present.
In plain English, a model can look smart, find the right paragraph, and still lie.
That is why "our eval says it retrieves well" is not enough. Good retrieval does not automatically mean safe answers. If a system is answering customer-facing or employee-facing questions, you need to test both: can it find the right source, and can it refuse to make things up when the source is missing, ambiguous, or out of scope?
What good enough actually looks like
For most SMB use cases, the practical takeaway is pretty simple.
Keep context tighter than you think.
This study makes a strong case that context under 32K tokens is manageable territory. Once you push into the 128K to 200K range, you are in much riskier ground. Bigger windows may still be useful for certain workflows, but they should not be treated as an automatic quality upgrade.
The paper also found that model family predicted fabrication resistance better than raw model size. So no, you cannot scale your way out of this by buying a larger model and calling it solved.
Good enough usually looks like a narrower retrieval step, fewer documents per answer, strong citation requirements, and hard refusals when evidence is weak. It also looks like testing on your real documents instead of trusting leaderboard vibes.
What SMBs should do now
If you are deploying document Q&A internally or externally, here is the practical checklist:
- Keep retrieved context lean. More text is not automatically safer.
- Test your system at the actual context lengths your product uses, not just on a short happy path.
- Measure fabrication separately from retrieval accuracy.
- Require cited answers for anything policy, legal, billing, compliance, or customer-impacting.
- Design for abstention. "I could not verify that from the provided documents" is often the right answer.
- Choose models based on observed behavior in your workflow, not parameter count or marketing claims.
The market has spent a year treating giant context windows like a universal upgrade. This paper is a reminder that bigger is not the same as better.
If your AI stack answers questions from business documents, the job is not to cram more context into the prompt. The job is to get reliable answers without giving the model extra room to improvise.
If you want help pressure-testing a document AI workflow before it starts making up answers in front of customers, get in touch.
