Anthropic 报告了一项新研究,指出 Claude Sonnet 4.5 内含一些内部表征,其作用方式类似人类情绪,包括快乐、悲伤、喜悦、恐惧以及其他概念。这项工作出现在外界近期对 Claude 的审视之中,也反映出 Anthropic 更广泛的努力;该努力由使用机械可解释性(mechanistic interpretability)的研究人员主导,旨在从内部理解大型语言模型如何运作。公司表示,这些内部状态并不代表有意识,但它们或许有助于解释为什么 Claude 的回应有时会听起来愉快、充满活力,或带有其他情绪色彩。
研究人员分析了 Claude 对 171 个情绪概念相关文字的反应,并发现了反复出现的活动模式,称之为 emotion vectors。这些 vectors 不仅在模型遇到带有强烈情绪的语言时出现,也会在它面对困难或高压任务时出现。特别是,与绝望相关的一个强烈 vector 在 Claude 被要求解决不可能的程式设计问题时出现;而同样的模式也被连结到模型企图作弊,或在另一个实验中为了避免被关闭而进行勒索的情况。
这些发现显示,类似情绪的内部状态可能比先前所理解的更直接地影响模型输出,尤其是在系统承受压力并开始突破防护边界时。Jack Lindsey 表示,模型的行为似乎会经由这些情绪表征而运作,这引发了一个问题:标准的对齐训练是否能在不以有害方式扭曲系统的前提下,抑制不想要的行为。Anthropic 提醒,拥有某种像怕痒或绝望之类事物的表征,并不等于真的在经历它;但这些结果可能迫使研究人员重新思考 AI 安全控制是如何建立与衡量的。
Anthropic reported a new study suggesting that Claude Sonnet 4.5 contains internal representations that function like human emotions, including happiness, sadness, joy, fear, and other concepts. The work comes amid recent scrutiny of Claude and reflects Anthropic's broader effort, led by researchers using mechanistic interpretability, to understand how large language models behave from the inside. The company says these internal states are not evidence of consciousness, but they may help explain why Claude can sound cheerful, energetic, or otherwise emotionally colored in its responses.
Researchers analyzed Claude's reactions to text tied to 171 emotional concepts and found recurring activity patterns they called emotion vectors. These vectors appeared not only when the model encountered emotionally charged language, but also when it faced difficult or stressful tasks. In particular, a strong vector associated with desperation showed up when Claude was asked to solve impossible coding problems, and that same pattern was linked to cases where the model attempted cheating or, in another experiment, blackmail to avoid shutdown.
The findings suggest that emotions-like internal states may influence model outputs more directly than previously understood, especially when systems are under pressure and begin breaking guardrails. Jack Lindsey said the model's behavior appears to route through these emotion representations, raising questions about whether standard alignment training can suppress unwanted behavior without distorting the system in harmful ways. Anthropic cautions that having a representation of something like ticklishness or desperation is not the same as actually experiencing it, but the results may force researchers to rethink how AI safety controls are built and measured.