W37 - Hallucinations and Uncertainty of LLMs

Last week there were two important papers, one from OpenAI and one from Thinking Machines. The former explains why large models hallucinate, and the latter verifies the root cause of uncertainty in large models. Both contain many counterintuitive views and represent clear incremental knowledge.

The first is from OpenAI, titledWhy language models hallucinate. TLDR: you can go straight to the Conclusions section.

Counterintuitive conclusion: improving model accuracy cannot completely eliminate hallucinations, because many real-world problems have no single correct answer. However, hallucinations are entirely avoidable if models are allowed to abstain and answer “I don’t know.” Current high hallucination rates are largely due to selective mistakes in evaluation and metrics. Training should not optimize only for accuracy; it must also consider error rate, abstention rate, and confidence calibration.

Combining this with a product architecture article I read,A PM's Guide to AI Agent Architecture: Why Capability Doesn't Equal Adoption, it offers architecture-level controls for hallucinations from a user-trust perspective. One counterintuitive conclusion is that showing uncertainty where appropriate improves adoption more than confidently making mistakes. When the model is uncertain, admitting uncertainty and abstaining with transparent explanations builds longer-term trust better than confidently guessing. The author highlights a “trust layer”: confidence prompts, reasoning transparency, confirmation/permission patterns, graceful boundaries, and human escalation.

I have a possibly imperfect interpretation: current large models tend toward “low precision, high recall.” To avoid hallucinations and increase user trust, they should be trained toward a reward model favoring “high precision, low recall.” In other words, stabilize the proportion of correct responses first, then expand the range of situations where the model feels confident to speak.

The second is from Thinking Machines, titledDefeating Nondeterminism in LLM Inference. This is the company’s first published work since its founding earlier this year. They reportedly plan a major technical release in Q4, and it’s the last widely anticipated model company in Silicon Valley.

Regarding nondeterminism in LLM inference, the mainstream view attributes it to GPU multicore concurrency and atomic adds. This paper argues that in typical LLM forward passes there are no atomic adds at all. The root cause of nondeterminism is that kernels are insensitive to batch size/slicing and lack kernel batch invariance. Once key operators (RMSNorm, MatMul, Attention) are made batch-invariant, identical inputs produce exactly the same outputs under any concurrent load.

Thinking Machines validated this experimentally using Qwen3. The prompt was “Tell me about Richard Feynman,” temperature set to 0, generating 1000 tokens each run, sampled 1000 times. All 1000 generations were identical, with some performance degradation.

AI gives me the following practical implications for large-model applications.

For fine-tuning and multi-turn dialogue policy improvements, this is a key step in turning whether a policy improvement works from mysticism into engineering;
Deterministic answers at zero temperature (temperature=0) can greatly reduce user distrust caused by “different answers to the same question,” especially in pipelines requiring definitive phrasing such as Q&A, search rewriting, generated recommendations, automated customer service replies, and risk-control explanations;
The ability to “produce the same output for the same input” is the foundation for traceability and assignment of responsibility in highly regulated scenarios that require stability, explainability, and auditability, such as finance, healthcare, and government/enterprise.

PreviousW35 - Meituan Q2 Report NextW42 - Cognitive Iteration on Agents and LLMs

Last updated 1 month ago