OpenAI 和 Anthropic 正在积极进军医疗领域,但核心可靠性问题仍未解决。OpenAI 表示,每周有超过 2.3 亿用户通过 ChatGPT 寻求健康相关建议,并正在包括波士顿儿童医院和纪念斯隆凯特琳癌症中心在内的医院试点 ChatGPT for Healthcare。Anthropic 推出了面向临床医生的 Claude 版本,并声称在特定任务上显著提升准确率,例如 ICD-10 编码准确率从消费版的 75% 提升至医疗版的 99.8%,但整体临床可靠性仍不透明。
在被问及诊断能力时,Anthropic 提供的基准结果存在明显落差:在测试药物剂量和实验室计算的 MedCalc 上准确率为 92.3%,但在模拟电子病历环境中执行临床任务的 MedAgentBench 上仅为 61.3%。这两项指标均未直接反映诊断或治疗建议的可靠性。OpenAI 同样未公布医疗建议中的幻觉或错误发生率,仅表示新模型更可靠。在医疗场景中,即使极低的错误率也可能导致致命后果,因此缺乏量化披露构成重大风险。
历史经验已发出警示。Google 于 2008–2011 年推出的 Google Health 因公众不信任而失败。随后,DeepMind 在 2018 年因访问超过 100 万名英国患者数据而终止项目,Project Nightingale 又涉及数百万美国患者记录。这些失败并非源于算法失效,而是信任崩塌。相比之下,AI 医疗工具的风险更高:一旦在诊断或临床决策中出错,后果直接涉及生死,使透明披露准确率和失误概率成为不可回避的前提。
OpenAI and Anthropic are aggressively entering healthcare, despite unresolved reliability risks. OpenAI says more than 230 million users seek health-related advice weekly via ChatGPT, and is piloting ChatGPT for Healthcare at hospitals such as Boston Children’s Hospital and Memorial Sloan Kettering. Anthropic has launched a clinician-focused version of Claude trained on medical databases. While Anthropic reports accuracy improvements in narrow tasks, such as ICD-10 coding rising from 75% in the consumer model to 99.8% in the medical version, broader clinical reliability remains unclear.
When pressed on diagnostic performance, Anthropic cited mixed benchmarks: 92.3% accuracy on MedCalc, which tests medical calculations like dosing and lab values, and only 61.3% on MedAgentBench, which evaluates simulated clinical task execution. Neither metric directly measures the accuracy of diagnostic or treatment recommendations. OpenAI likewise declined to publish concrete hallucination or error rates in medical advice, stating only that newer models are more reliable. This lack of quantified error disclosure is critical, as even small mistake rates in healthcare can carry fatal consequences.
Historical precedent underscores the risk. Google exited consumer health records after its 2008–2011 Google Health effort failed, largely due to public mistrust. Later projects deepened skepticism: DeepMind’s 2018 kidney injury alert system accessed data from over one million UK patients, and Project Nightingale involved millions of US medical records. These efforts faltered not due to algorithmic failure, but because of trust deficits. For AI health tools, the stakes are higher: unlike data aggregation, errors in diagnosis or clinical guidance directly affect life-and-death outcomes, making transparency about accuracy and failure rates indispensable.