← 返回 Avalaches

统计方法在语言研究中的作用始于1948年Claude Shannon的实验,他通过调整词共现频率生成更接近英语外观的句子,显示单纯模仿统计结构即可产生逼真文本。这一方向在20世纪后期因运算能力提升而扩大,但与Noam Chomsky的生成语言学长期存在理论冲突:后者坚持统计方法无法区分语法错误与“Colorless green ideas sleep furiously”式的语法正确但无意义句子。进入1990年代与2000年代,大规模语料处理推动统计语言学增加影响力,但即便支持者也未预期LLM能在数年内超越传统方法。

自2019年起,语言模型在阅读理解测试中已可匹敌或超过人类,促使研究者质疑既有评估体系的可靠性。随后的模型在解析复杂句法结构及推断虚构语言语法上继续提升,令难题设计变得困难。美洲地区研究显示,LLM对“可能语言”与“不可语言”的学习难度存在差异,反驳生成语言学关于统计系统无法表现类似人类学习偏好的预期,并提供间接研究语言习得机制的新工具。

最新研究将探究范围扩展至原本难以实证的问题,例如符号奠基问题,涉及语言意义是否必须来自外部世界接触。访谈与理论工作指出,LLM在完全基于文本学习的情况下仍可展现高度流畅性,引发关于其内部结构与抽象表示方式的数学分析,包括利用范畴论来解释模型如何在无外部感知输入时仍能形成稳定意义映射。

Statistical approaches to language began in 1948 with Claude Shannon’s experiments, which showed that adjusting for word co-occurrence frequencies could produce text that resembled English despite lacking meaning. This trajectory later clashed with Noam Chomsky’s generative linguistics, which held that statistics could not separate ungrammatical strings from grammatical but nonsensical ones such as “Colorless green ideas sleep furiously.” Through the 1990s and 2000s, increases in computational power expanded statistical linguistics, yet even its proponents did not anticipate how quickly LLMs would surpass older methods.

By 2019, language models were already matching or exceeding humans on reading comprehension tests, raising questions about how reliable those benchmarks were. Newer models advanced further, correctly parsing complex syntactic structures and inferring grammars of invented languages, making it increasingly hard to devise challenging evaluations. Research also showed that LLMs learn “possible” and “impossible” languages with different levels of difficulty, contradicting generative claims that statistical systems cannot exhibit humanlike learning biases and offering a new indirect method for studying language acquisition.

Recent work extends to questions once considered untestable, including the symbol grounding problem, which asks whether meaning requires direct contact with the external world. Interviews and theoretical proposals indicate that LLMs, trained solely on text, can still develop highly fluent output, motivating mathematical analyses of their internal representations, including the use of category theory to explain how models map and stabilize meaning without external grounding.

2025-11-25 (Tuesday) · 4556c609605632785eb2412ef1e3931b75289c82