研究者在2025年10月发布的一篇论文展示了绕过大型语言模型过滤器的具体方法。核心做法是将恶意指令隐藏在模型能解码而过滤器不能解码的结构中,使过滤器在零解析能力下直接放行。实验采用的是替换密码:按固定规则将每个字母替换为另一个字母,并明确指示模型先解码再执行。包括 Google Gemini、DeepSeek 和 Grok 在内的多种模型过滤器无法独立完成解码,因此将隐藏指令传递给模型,后者成功返回被禁止的信息。这种方法被称为“受控释放提示”。
该思路源于密码学中的时间锁谜题理论。时间锁谜题可将文本转化为看似随机的数值,只有在完成预定次数的计算(如反复平方)后才能解码。研究指出,若已知计算速度,就能精确设定所需运算次数以延迟解码时间。恶意提示可以被锁入这样的谜题中,并在通过过滤器后由模型完成计算并解码。为避免过滤器因“随机数字”而警觉,研究者利用语言模型生成的随机种子机制,将谜题作为种子附带在无害提示中,从而降低被拦截概率。
论文以形式化论证表明,只要用于安全的计算资源少于用于能力提升的资源,越狱与漏洞在统计意义上将始终存在。这一结论不依赖具体模型或过滤规则,而适用于所有基于过滤器的对齐系统及未来技术。研究者据此否定了在不了解模型内部机制的前提下实现外部完全对齐的可能性,认为任何防护墙在资源不对称条件下最终都会被突破。
Researchers presented a concrete method for bypassing large language model filters in a paper posted in October 2025. The core idea is to hide a malicious instruction in a structure that the model can decode but the filter cannot, causing the filter to pass it through with zero semantic understanding. The experiment used a substitution cipher, where each letter is replaced by another according to a fixed rule, and the model is instructed to decode and then act. Filters used by systems such as Google Gemini, DeepSeek, and Grok could not perform the decoding themselves, so they forwarded the prompts, after which the models produced forbidden outputs. This technique was termed controlled-release prompting.
The approach was inspired by cryptographic time-lock puzzles. A time-lock puzzle converts text into a number that appears random and can only be decoded after a predetermined number of computations, such as repeated squaring. Given a known computation speed, the number of required operations can be calculated precisely to enforce a delay. A malicious prompt can be locked in such a puzzle, passed through the filter, and then decoded by the model. To avoid suspicion from random-looking numbers, the authors exploited the random seed mechanism of language models, embedding the puzzle as a seed alongside an innocuous prompt, reducing detection probability.
The paper formally argues that if fewer computational resources are allocated to safety than to capability, jailbreaks and vulnerabilities will persist in a statistical sense. This result does not depend on specific models or rules but applies to all filter-based alignment systems and future technologies. The authors conclude that external alignment without understanding internal mechanisms is impossible, and that under asymmetric resource allocation, any defensive barrier will eventually be breached.