人工智能计算需求正从训练转向推理，这…

人工智能计算需求正从训练转向推理，这正在削弱 GPU 的相对优势。McKinsey 估计，到本十年末，推理将占 AI 数据中心需求的五分之三。训练依赖大规模并行计算，例如 Nvidia 的 B200 拥有超过 16000 个核心；但推理分为预填充和解码两个阶段，尤其在解码时需要持续读取模型权重与已生成 token，因此对内存访问速度更敏感。随着模型和提示长度增长，这一内存瓶颈正变得更严重。

问题在于“内存墙”：AI 芯片依赖高速片上 SRAM 和更大但更慢的片外 DRAM，后者访问速度可慢 10 倍且更耗能。研究发现，过去 20 年计算性能大致每隔几年增长约 3 倍，而片外内存带宽仅提升约 1.6 倍。现有方案包括软件分工与批处理，也包括新硬件设计。Nvidia 新芯片用约 500 兆字节 SRAM 配合软件调度；Cerebras 的晶圆级芯片拥有 900000 个核心和 44GB 片上 SRAM，并宣称推理速度最高可达传统设计的 15 倍。

竞争者正尝试更激进路线。MatX 设计可拆分脉动阵列，以便在预填充和解码之间动态分配资源；d-Matrix 采用存内计算，把存储与计算合并；Etched 则专为 transformer 定制芯片。中国研究者甚至提出把模型权重直接编码进金属布线，几乎消除参数读取。但高度专用化也有风险：新芯片设计通常需要 12 至 18 个月，而 AI 算法迭代更快。因此，下一阶段 AI 可能确实需要不同处理器，但谁会胜出仍未确定。

Demand for AI computing is shifting from training to inference, weakening the relative advantage of GPUs. McKinsey estimates that by the end of the decade inference will account for three-fifths of demand in AI data centres. Training relies on massive parallelism, for example Nvidia’s B200 has more than 16,000 cores; but inference has two stages, prefill and decode, and especially during decode it needs constant access to model weights and previously generated tokens, making memory access speed more important. As models and prompt lengths grow, that memory bottleneck is worsening.

The problem is the “memory wall”: AI chips depend on fast on-chip SRAM and much larger but slower off-chip DRAM, which can be 10 times slower to access and more energy-intensive. One study finds that over the past 20 years computing performance roughly tripled every few years, whereas off-chip memory bandwidth improved by only about 1.6 times. Current responses include software workarounds and new hardware. Nvidia’s new chip pairs about 500 megabytes of SRAM with software orchestration; Cerebras’s wafer-scale chip has 900,000 cores and 44GB of on-chip SRAM, and claims inference up to 15 times faster than conventional designs.

Rivals are trying more radical approaches. MatX is building a splittable systolic array to allocate resources differently between prefill and decode; d-Matrix uses in-memory computing to merge storage and computation; Etched is designing chips specialised for transformers. Chinese researchers have even proposed encoding model weights directly into metal wiring, almost eliminating parameter fetching. But heavy specialisation is risky: a new chip typically takes 12-18 months to design, while AI algorithms evolve faster. So the next phase of AI may indeed require very different processors, but the eventual winner is still unclear.