W37 - LLM Hallucinations and Uncertainty

Last week there were two important papers, one from OpenAI and one from Thinking Machines. The former explains why large models hallucinate, and the latter validates the root cause of LLM nondeterminism. Both contain many counter-consensus viewpoints and represent clear incremental knowledge.

The first is from OpenAI:Why language models hallucinate. TLDR: you can read the Conclusions section at the end directly.

Counterintuitive conclusion: improving model accuracy cannot completely eliminate hallucinations, because many real-world questions have no definitive answers. However, hallucinations can be entirely avoided if large models are allowed to abstain and answer “I don’t know.” Current high hallucination rates are largely due to selective errors in evaluation and measurement. Training must not focus only on accuracy; it should also consider error rate, abstention rate, and confidence calibration.

Combined with a product architecture article I read,A PM's Guide to AI Agent Architecture: Why Capability Doesn't Equal Adoption, it offers architectural approaches to controlling hallucinations from the perspective of user trust. A counterintuitive takeaway is that showing weakness when uncertain improves adoption more than confidently making mistakes. When the model is uncertain, admitting uncertainty, abstaining, and explaining transparently builds longer-term trust more effectively than confidently guessing. The author emphasizes a “trust layer”: confidence cues, transparent reasoning, confirmation/permission patterns, graceful boundaries, and human escalation.

I have a possibly imperfect interpretation: current large models tend toward “low precision, high recall.” To avoid hallucinations and increase user trust, we should train their reward models toward “high precision, low recall.” In other words, first stabilize the proportion of correct responses, then expand the range of cases where the model is allowed to answer.

The second is from Thinking Machines:Defeating Nondeterminism in LLM Inference. This is the first public result from the company since its founding earlier this year. They say a major technical release is planned for Q4. It is also the last widely anticipated model company in Silicon Valley.

Regarding nondeterminism in LLM inference, the mainstream view attributes it to GPU multicore concurrency and atomic adds. This paper argues that typical LLM forward passes do not involve any atomic adds at all. The root cause of nondeterminism is that kernels are insensitive to batch size/partitioning and lack batch invariance. Once key operators (RMSNorm, MatMul, Attention) are made batch-invariant, the same input yields identical output under any concurrent load.

Thinking Machines validated this experimentally using Qwen3. The prompt was “Tell me about Richard Feynman,” temperature set to 0, generating 1000 tokens each time, sampled 1000 times. All 1000 generations were identical, with some performance degradation.

For practical implications in large-model applications, AI gives me the following answers.

  1. For fine-tuning and multi-turn dialogue policy improvement, this is the key step that turns “whether a policy improvement works” from mysticism into engineering;

  2. Consistency at zero temperature (temperature=0) can greatly reduce user distrust caused by “different answers to the same question,” especially in pipelines that require definitive statements, such as Q&A, search rewriting, generated recommendations, automated customer service replies, and risk-control explanations;

  3. A guarantee of “same input yields same output” provides the foundation for traceability and assignment of responsibility in highly regulated, stability-expectant domains like finance, healthcare, and government/enterprise where stability, explainability, and auditability are required.

Last updated