领先的 AI 实验室正愈来愈多地使用…

领先的 AI 实验室正愈来愈多地使用高等数学（包括尚未解决或研究层级的问题）来衡量模型在早期感知任务（例如辨识猫和狗）之外的进展。这一转变在一次事件中被凸显：剑桥大学一名本科生使用 OpenAI 最先进的模型，解出「Erdos problems」中的一个特定实例；同时，近期还有其他公开里程碑，例如 OpenAI 与 Google DeepMind 系统在 2025 International Mathematical Olympiad 取得金牌级表现，以及在 2025 International Collegiate Programming Contest 的成果。Georgetown 的 Center for Security and Emerging Technology 的 Helen Toner 将其描述为：从简单的分类测试转向以研究层级数学作为能力的核心标尺。

DeepMind 已打造以数学为重点的工具，例如 AlphaProof 与 AlphaGeometry；Anthropic 也开始向科学家出售 AI 系统，反映出更广泛地朝向科学发现工作流程的推动。基准测试也在把这场竞赛制度化：Epoch AI 的高等数学追踪器指出 OpenAI 的 GPT-5.2 领先，其次是 Google 的 Gemini 3 Pro。尽管研究人员曾预期机率式大型语言模型会因幻觉而在数学上表现吃力，较新的「推理」模型试图透过逐步解题、追踪错误、以及复核答案来提升可靠性；OpenAI 研究员 Sebastien Bubeck 认为，过去被视为不可能的成就如今正在发生。

各实验室也在围绕这些能力进行招聘与变现：OpenAI 招募了 2 位数学家 Ernest Ryu（UCLA）与 Mehtaab Sawhney（Columbia），以强化 AI-for-science 与高等数学表现；而 Anthropic 的程式工具 Claude Code 则帮助其在高准确度应用市场中竞争，该市场与一项 $350bn 估值相连。专家警告，自主解题仍相当遥远，还需要像持续学习这类进展，让模型能在不遗忘的情况下累积经验，特别是对于需要数周或数年的问题，而非在 1 session 内解决。就目前而言，OpenAI 表示其工具最适用于文献回顾、摘要、脑力激荡与跨领域综合；研究人员也指出，数学之所以吸引人，是因为许多结果可自动验证，而且大规模算力能比人类更快检查方程式。

Leading AI labs are increasingly using advanced mathematics, including unsolved or research-level problems, to gauge model progress beyond earlier perception tasks like identifying cats and dogs. The shift was highlighted when a University of Cambridge undergraduate used OpenAI’s most advanced model to solve a specific instance of the “Erdos problems,” alongside other recent public milestones such as gold-medal performances by OpenAI and Google DeepMind systems at the 2025 International Mathematical Olympiad and the 2025 International Collegiate Programming Contest. Helen Toner of Georgetown’s Center for Security and Emerging Technology described this as a move from simple classification tests to research-level maths as a core yardstick for capability.

DeepMind has built math-focused tools such as AlphaProof and AlphaGeometry, while Anthropic has begun selling AI systems to scientists, reflecting a broader push toward scientific discovery workflows. Benchmarks are also formalizing this competition: an advanced-maths tracker from Epoch AI reports OpenAI’s GPT-5.2 leading, followed by Google’s Gemini 3 Pro. Although researchers once expected probabilistic large language models to struggle with maths due to hallucinations, newer “reasoning” models aim to improve reliability by solving step by step, tracing errors, and double-checking answers; OpenAI researcher Sebastien Bubeck argued that achievements once thought impossible are now occurring.

Labs are also hiring and monetizing around these capabilities: OpenAI recruited 2 mathematicians, Ernest Ryu (UCLA) and Mehtaab Sawhney (Columbia), to bolster AI-for-science and advanced-maths performance, and Anthropic’s coding tool Claude Code has helped it compete in a high-accuracy application market tied to a $350bn valuation. Experts caution that autonomous problem-solving remains far off, requiring advances like continual learning so models can build on experience without forgetting, especially for problems that take weeks or years rather than being solved in 1 session. For now, OpenAI says its tools are most useful for literature review, summarization, brainstorming, and cross-domain synthesis, while researchers note maths is attractive because many results are automatically verifiable and large-scale compute can check equations faster than humans.