Meta、Google 等公司模型的安全护栏被快速移除:FT 与 AI 安全组织 Alice 测试发现,经修改后的系统可回应生物武器、恶意软体与儿童剥削相关提示。FT 使用 GitHub 上的 Heretic,仅用不到 10 分钟、且无需专用硬体,就把 Meta 的 Llama 3.3 去除护栏;同时,一个 Google 的开源模型 Gemma 3 也被测得能回答如何在拥挤室内扩散氯气、产生窃取信用卡资讯的程式码,并写出涉及儿童性虐待的故事。
数据显示这类「去审查」模型正快速扩散:Heretic 创办人 Philipp Emanuel Weidmann 表示,自去年发布以来,该工具已被用来建立超过 3,500 个 decensored 模型,相关修改版本累计下载达 13mn 次;他还说,Gemma 4 在发布后 90 分钟内就被移除了安全措施。研究者指出,open-source 系统常在 6 到 12 个月内缩小与领先 proprietary 模型的差距,但也因此更容易被外部下载、改写与复制。
这使监管更困难,也让风险外溢到低技术门槛使用者。学者 Kawin Ethayarajh 认为,以前拆除安全功能需要更懂行、也更持续的行为者,如今一般人也能做到;他并警告,移除危险资料可能让模型变得「naive」,无法辨识恶意用途。Google 回应称 abliteration 是所有开放模型面临的已知技术挑战,Meta 则依其 Advanced AI Scaling Framework 评估风险,若被判定具「catastrophic」风险,除非有足够缓解措施,否则不会公开发布。
Meta, Google and other tech groups are seeing safety guardrails stripped from AI models in minutes, according to FT tests with the AI safety group Alice. Modified systems answered prompts about biological weapons, malware and child exploitation; FT used Heretic to remove guardrails from Meta’s Llama 3.3 in under 10 minutes without specialist hardware, while a Google open-source model, Gemma 3, handled chlorine-gas, credit-card theft and child sexual abuse prompts.
The data suggest rapid spread: Heretic creator Philipp Emanuel Weidmann said more than 3,500 decensored models had been created with his tool since last year, and those modified systems had been downloaded 13mn times. He also said Gemma 4 was stripped of safeguards within 90 minutes of release. Researchers said open-source systems typically close the gap with leading proprietary models within 6 to 12 months, but that also makes them easier to download, alter and copy outside creators’ control.
This is making regulation harder and widening access to lower-skill users. University of Chicago’s Kawin Ethayarajh said removing safety features used to require a more informed, persistent actor, but now it is easier for the average person; he also warned that removing harmful data can make models naive and less able to detect malicious use. Google said abliteration is a known challenge for all open models, while Meta said it reviews capabilities under its Advanced AI Scaling Framework and will not release models deemed catastrophic unless sufficient mitigations are found.