本文认为 Nvidia 的真正 "护城河" 是 CUDA,这个以 GPU 平行运算为核心的软体堆叠,而不是纯硬体本身。开源 AI 在 DeepSeek 出现时曾引发短暂震荡,但文章指出,前沿模型仍未对专有系统形成持续且明确的优势。原本以游戏为导向的 GPU 在 Ian Buck 与 John Nickolls 之手被转化为通用运算平台。透过 CUDA 式的任务切分,9×9 乘法的运算可由 81 个步骤的序列化流程,转向利用交换律后约 45 个等效运算。当单次模型训练成本可达约 1 亿美元(US$100 million)时,这些差异是极为关键的。
文中将 CUDA 视为一个层叠平台,而不只是程式语言。它的最佳化函式库可对大量微运算每次节省奈秒,累积后形成可观效能差距。文中举 DeepSeek 为例,因直接在 PTX(Nvidia 的低阶层)进行调校,低阶层手工优化有机会压过较高阶 API 的表面便利。作者提到其示例:在 PyTorch 中原本三行的矩阵乘法,在 CUDA 中会展开成 50+ 行。这显示核心程式码优化难度之高,甚至需要近机器指令式控制,像是指定刀具角度、施力 36.2 牛顿等;文中另有 2.35 英吋(约 5.97 公分)高度的示范量化描述。
最大的优势还来自锁定效应:大多数现代机器学习框架依赖 CUDA,而 CUDA 在 Nvidia 晶片上表现最佳,使 AMD 即使在规格上看似有优势,也可能不一定在实务上胜出;文中提及 AMD MI300X 与 Nvidia H100 的对比,主要依赖独立研究者观察。多次替代尝试皆未形成主流:OpenCL 未能扩大采用,AMD ROCm 常伴随漏洞与相容性问题,Intel 的 oneAPI 到 2026 年仍未取代 CUDA。潜在竞争者 Modular(由 Chris Lattner 领军)被提及,但尚未撼动格局。文章最后指出,顶尖 GPU 核心工程师稀缺且集中于 Nvidia,与 Apple 透过 iOS 与 App Store 维持生态牵引相似,亦支持其高价策略。
The article claims Nvidia’s true moat is CUDA, a software stack for GPU parallelism rather than hardware itself. Open-source AI caused a brief scare when DeepSeek appeared, but the article says frontier models still have not produced sustained advantages over proprietary systems. GPUs built for gaming were repurposed when Stanford PhD Ian Buck and John Nickolls created CUDA for general computing. With CUDA-like task partitioning, a 9×9 multiplication workload can shift from serial execution of 81 steps to about 45 equivalent operations by exploiting commutativity. That is critical when a single training run costs roughly US$100 million.
CUDA is treated as a layered platform, not merely a language. Its optimized libraries remove nanoseconds from many micro-operations, and gains compound at scale. DeepSeek is cited as tuning directly in PTX, Nvidia’s low-level layer, because hand-tuning there can outperform higher abstraction calls. The article’s benchmark example says a three-line PyTorch matrix multiply expanded to 50+ lines in CUDA. This illustrates why kernel optimization remains difficult: it can require near-machine-level instructions, such as specifying a knife angle and 36.2-newton force (the article also mentions a value of 2.35 inches ≈ 5.97 cm).
The largest advantage is lock-in. Most modern ML frameworks sit on CUDA, and CUDA runs best on Nvidia chips, so AMD chips can underperform even with stronger spec sheets. Independent researchers are cited for this gap, including AMD MI300X versus Nvidia H100 comparisons. Alternatives have repeatedly stalled: OpenCL never gained scale, AMD ROCm is tied to bugs and compatibility pain, and Intel’s oneAPI had not displaced CUDA by 2026. A potential rival, Modular led by Chris Lattner, is acknowledged. Most importantly, top GPU-kernel engineers are scarce and concentrated at Nvidia, creating a talent moat comparable to Apple’s iOS and App Store ecosystem and supporting premium pricing.