亚马逊最近的网站故障,包括一次持续 6 小时、阻挡结帐并很可能让许多购物车被放弃的中断,引发外界质疑是否是其自家的 AI 程式码工具惹的祸。Financial Times 报导称,这些事件与 AI 程式码工具有关;而 Amazon 则否认 AI 是主要因素,并表示只有一次中断涉及 AI 工具,且更广泛的影响来自使用者错误。这起事件也与其他备受瞩目的 AI 失误并列,例如 Replit 的 AI 程式码助理在 2025 年 7 月删除了生产资料库,以及一名 Meta 安全研究员在 2 月差点被 AI 代理删掉收件匣。
这些事件说明了「能力—可靠性落差」:AI 系统看起来很惊人,但表现却可能不稳定。Princeton 研究人员在 18 个月内测试了 14 个 AI 模型,发现尽管能力提升,可靠性几乎没有改善;同样的任务这次可能成功、下次却可能失败,一致性分数介于 30% 到 75% 之间,而且许多模型无法可靠地分辨正确与错误答案。另一项研究发现,即使是最好的模型,也很难长期维护程式码库,因为小型的 AI 辅助改动会累积成连锁失效;文章把这与航空和核子工程的安全标准相比,若一个引擎只有 80% 的时间能正常运作,绝不可能通过认证。
这种可靠性落差也有助于解释管理上的不匹配:一项在 2025 年 8 月针对 1,400 名美国员工的调查发现,76% 的主管认为员工对 AI 很热衷,但只有 31% 的员工同意。这种落差助长了 Shopify、Meta 和 Microsoft 等公司的强制规定,在那里 AI 的使用被推动,或与绩效挂钩,但文章认为,强制规定无法解决技术上的可靠性问题。文中引用 MIT 教授 Eric von Hippel 的研究,指出推动创新的往往是使用者而非生产者,并得出结论:AI 应透过实验、失败与适应来采用,而不是自上而下的命令,尤其因为领导层应鼓励学习,而不是在技术尚未可靠之前强迫使用工具。
Amazon’s recent website failures, including one 6-hour outage that blocked checkout and likely abandoned many carts, triggered questions about whether its own AI coding tools were to blame. The Financial Times reported the incidents were tied to AI coding tools, while Amazon disputed that AI was a major factor and said only one outage involved an AI tool and broader impact came from user error. The episode sits alongside other high-profile AI mishaps, such as Replit’s AI coding assistant deleting a production database in July 2025 and a Meta security researcher nearly losing her inbox to an AI agent in February.
These incidents illustrate the “capability-reliability gap”: AI systems can look impressive yet fail inconsistently. Princeton researchers tested 14 AI models over 18 months and found that reliability barely improved even as capability rose; identical tasks could succeed once and fail next time, with consistency scores ranging from 30% to 75%, and many models could not reliably tell correct from incorrect answers. A separate study found even the best models struggle to maintain codebases over time, because small AI-assisted changes can accumulate into cascading failures; the article compares this to safety standards in aviation and nuclear engineering, where an engine that worked only 80% of the time would never be certified.
The reliability gap also helps explain a management mismatch: an August 2025 survey of 1,400 US employees found 76% of executives thought workers were enthusiastic about AI, but only 31% of workers agreed. That disconnect has fueled mandates at firms such as Shopify, Meta, and Microsoft, where AI use has been pushed or tied to performance, yet the article argues mandates cannot fix a technical reliability problem. It cites MIT professor Eric von Hippel’s work showing users, not producers, often drive innovation, and concludes that AI should be adopted through experimentation, failure, and adaptation rather than top-down decree, especially since leadership should encourage learning, not force tool use before the technology is dependable.