AI在事实查核上表现并不可靠。文中指…

AI在事实查核上表现并不可靠。文中指出，AI 搜寻回答约有三分之一错误；Tow Center 2025 年研究显示超过 60% 的 AI 搜寻回应不准确，BBC 研究则把聊天机器人的错误率估在约 45%，也就是接近一半。作者对几个模型做测试时，Grok、Claude、Gemini、ChatGPT 都能提出查核计划，却没有真正完成实际核对。

各种基准测试也显示同样问题。RealFactBench 中，Claude 以 73% 准确率领先；OpenAI 的 SimpleQA 含逾 4,000 题单一答案问题，没有任何模型超过 50%；Google 今年把题库缩到 1,000 题后，Gemini 2.5 Pro 以 55.6% 居首。作者提到，ChatGPT 自称在某些专业测试可达 90% 到 96%，但其提供的来源却混乱甚至不存在。

因此，AI 更适合协助整理大量资料，而不是取代人工查核。Full Fact 的工具已在 40 多个国家处理社群贴文与逐字稿，再把可疑主张交给人类确认；国际事实查核网络领导人 Angie Holan 也主张，应学会使用这些工具，但必须由人类把关。作者最后强调，真正关键的工作仍是追问电话访谈、比对来源冲突、辨认语气与脉络，以及处理那些不在网路上的知识。

AI is not dependable for fact-checking. The article says AI search answers are wrong about one-third of the time; a 2025 Tow Center study found more than 60% of AI search responses were inaccurate, while a BBC study put chatbot wrongness at about 45%, nearly half. When the author tested Grok, Claude, Gemini, and ChatGPT, all produced plans for checking facts, but none actually carried out the verification.

Benchmark results show the same pattern. In RealFactBench, Claude led with 73% accuracy. OpenAI’s SimpleQA included more than 4,000 single-answer questions, and no model exceeded 50% accuracy; after Google trimmed the benchmark to 1,000 questions, Gemini 2.5 Pro came out first at 55.6%. ChatGPT claimed 90% to 96% accuracy on some professional-style tests, but its citations were confused or nonexistent.

The conclusion is that AI can help sort large volumes of material, but it cannot replace human judgment. Full Fact’s tools operate in more than 40 countries by scanning posts and transcripts and flagging claims for people to check, and Angie Holan of the International Fact-Checking Network argues these tools should be learned, not avoided, but always supervised. The author ends by noting that human fact-checking still depends on calls, context, conflicting sources, and offline knowledge that AI cannot reliably recover.