← 返回 Avalaches

人工智慧研究界正面临一个与其核心技术直接相关的反噬:大量由大型语言模型(LLMs)产生的低品质论文与审稿意见(所谓「AI slop」)正削弱外界对科学工作的信心,并可能把虚构主张与捏造内容带入文献体系。多个AI会议因此在近月迅速收紧LLM于写作与审稿的使用规范;Inioluwa Deborah Raji 指出,业界一方面高喊AI将改造各领域,另一方面本领域却因广泛使用AI而经历混乱。ICLR 更新指引并警告:若未揭露「大量」LLM使用,论文将被拒;对以LLM产出低品质审稿者亦将处分,甚至影响其自身投稿。Hany Farid 将其上升到科学信任问题:若充斥错误与粗制滥造,社会为何要信任研究者。

多项研究与分析量化了渗透程度与品质风险。Stanford University 于 2025 年 8 月的研究发现,最多 22%(0.22)电脑科学论文含有LLM使用。新创 Pangram 估计 ICLR 2025 的审稿中有 21%(0.21)完全由AI生成,且超过一半审稿至少使用部分AI(例如文字润饰);在投稿论文方面,9%(0.09)内容有一半以上由AI生成。ICLR 审稿人亦曾在 11 月标记一篇疑似AI生成的论文,该文依审稿分数竟进入前 17%(0.17)。GPTZero 于 1 月的研究则称,NeurIPS 去年在 50 篇论文中出现超过 100 个AI生成错误(>100/50,约 ≈2.0 错误/篇)。

同时,投稿量的剧增加深了「以量压质」的疑虑。NeurIPS 表示 2025 年收到 21,575 篇投稿,较 2024 年的 17,491 篇增加约 23.3%(+4,084),较 2020 年的 9,467 篇增加约 127.9%(+12,108,约 2.28 倍);甚至有单一作者在 NeurIPS 投稿超过 100 篇,远超一般研究者产出。Thomas G Dietterich 也观察到 arXiv 上电脑科学相关论文大幅增加,但他与其他研究者指出,增量究竟来自LLM或研究者总数成长仍难以区分,因缺乏业界一致标准来可靠侦测AI生成内容;可疑征兆包括文献表中的幻觉引用与错误图表。更深层的风险在于资料回馈回圈:企业(如 Google、Anthropic、OpenAI)推动模型作为「共同科学家」,并以学术资料训练;若其中AI生成文本比例上升,可能导致模型性能劣化。相关研究指出,当训练集含过多未经筛选的AI生成资料时,LLMs 会「崩塌」并产生胡言乱语,削弱可学习的多样性。Kevin Weil 强调LLM如同任何工具,能大幅加速探索,但必须核查,不能取代严谨。

Artificial intelligence researchers are facing a problem central to their own field: a flood of low-quality large language model (LLM) writing and reviewing—so-called “AI slop”—is threatening confidence in scientific work by injecting false claims and fabricated content. In response, major AI conferences have recently moved quickly to restrict how LLMs can be used in papers and peer review. Inioluwa Deborah Raji notes the irony that, while AI is celebrated as transformative elsewhere, AI research itself has been destabilized by widespread AI use. ICLR has updated its rules to warn that papers failing to disclose “extensive” LLM use will be rejected, and that reviewers submitting low-quality LLM-generated reviews can be penalized, potentially affecting their own submissions. Hany Farid frames the issue as a trust problem: if researchers publish incorrect, low-quality work, society has less reason to trust them as scientists.

Evidence suggests substantial penetration and measurable quality risk. A Stanford University study (August 2025) found that up to 22% (0.22) of computer science papers contained LLM usage. A Pangram analysis estimated that at ICLR 2025, 21% (0.21) of reviews were fully AI-generated and that more than half used some AI assistance (such as editing); among submitted papers, 9% (0.09) had more than half of their content generated by AI. Reviewers also flagged a suspected AI-generated paper that still ranked in the top 17% (0.17) by review scores. Separately, GPTZero reported that at NeurIPS last year there were over 100 AI-generated errors across 50 papers (>100/50, about ≈2.0 errors per paper).

A sharp rise in volume amplifies concerns that incentives are shifting from quality to quantity. NeurIPS said it received 21,575 submissions in 2025, up from 17,491 in 2024 (about +23.3%, +4,084) and 9,467 in 2020 (about +127.9%, +12,108, ≈2.28×); one author submitted more than 100 papers, far above typical output. Thomas G Dietterich also reports large growth in computer-science papers on arXiv, but researchers caution it is hard to disentangle LLM-driven production from a larger, more active field, especially without industry-wide detection standards; telltales include hallucinated references and incorrect figures. A longer-term risk is feedback contamination: companies (e.g., Google, Anthropic, OpenAI) promote models as “co-scientists” and train on scraped academic corpora, and rising AI-generated content could degrade performance. Prior work suggests LLMs can “collapse” into gibberish when trained on too much uncurated AI-generated data, reducing diversity, so—Kevin Weil argues—LLMs can accelerate exploration only if users rigorously check them.

2026-02-02 (Monday) · 48ab8713fa3b49171afecc3f5759ba74b69cab41