CUDA 13.1 被称为近 20…

CUDA 13.1 被称为近 20 年来最大更新，核心是在 SIMT 之上的 CUDA Tile，首版仅支援具 10.x 与 12.x 运算能力的 NVIDIA

CUDA 13.1 被称为近 20 年来最大更新，核心是在 SIMT 之上的 CUDA Tile，首版仅支援具 10.x 与 12.x 运算能力的 NVIDIA Blackwell GPU。版本同时在 runtime 暴露轻量级 green context，搭配更弹性的 `split()` API 来细致切分 Streaming Multiprocessor 资源，并为 Ampere（8.0）与更新架构加入静态 SM 分割，其中 Hopper（9.0 以上）以每区块 8 个 SM 为基本单位。

新的记忆体与装置分割能力还包括 MLOPart，可在具 10.0 与 10.3 运算能力的 Blackwell GPU 上建立最多两个记忆体区域最佳化分割，目前仅支援 NVIDIA B200 与 B300，未来扩展至 GB200 与 GB300。数学函式库方面，cuBLAS 加入 FP64 与 FP32 仿真，以及支援 FP8 与 BF16/FP16 的 grouped GEMM API，在 Blackwell 与 Hopper 架构上对 MoE 工作负载可提供最高约 4 倍效能提升，相较于多串流 GEMM。

cuSOLVER 提升批次特征分解效能，在批次大小 5,000、矩阵列数 24–256 的 SYEV 测试中，RTX PRO 6000 Blackwell 相较 NVIDIA L40S 约快 2 倍，GEEV 对 1,024 至 32,768 尺寸矩阵也呈现加速。工具更新包含 Nsight Compute 2025.4 的 CUDA Tile 核心剖析、Nsight Systems 2025.6.1 的系统层级 CUDA 追踪、Compute Sanitizer 2025.4 透过 `-fdevice-sanitize=memcheck` 的编译期修补，以及 CCCL 3.1 提供决定论模式与单阶段 CUB API。

CUDA 13.1 is described as the largest CUDA update in nearly 20 years, centered on CUDA Tile, a tile-based model that runs above SIMT and initially supports only NVIDIA Blackwell GPUs with compute capability 10.x and 12.x. The release also exposes lightweight green contexts in the runtime, lets developers partition Streaming Multiprocessors with a more flexible `split()` API, and adds static SM partitioning for Ampere (8.0) and newer GPUs, where Hopper (9.0+) devices use chunks of 8 SMs per partition.

New memory and device-partitioning features include MLOPart, which creates up to two memory-locality-optimized partitions per GPU on Blackwell compute capability 10.0 and 10.3 devices, currently limited to NVIDIA B200 and B300, with GB200 and GB300 support planned. Math libraries gain FP64 and FP32 emulation in cuBLAS and a grouped GEMM API for FP8 and BF16/FP16 that can deliver up to 4x speedup over multi-stream GEMM in Mixture-of-Experts workloads on Blackwell and Hopper architectures.

cuSOLVER improves batched eigen-decomposition, showing about 2x speedup for batched SYEV on a batch of 5,000 matrices with 24–256 rows on RTX PRO 6000 Blackwell versus NVIDIA L40S, and GEEV speedups for matrix sizes from 1,024 to 32,768. Tooling updates include CUDA Tile kernel profiling in Nsight Compute 2025.4, system-wide CUDA tracing in Nsight Systems 2025.6.1, Compute Sanitizer 2025.4 compile-time patching via `-fdevice-sanitize=memcheck`, and CCCL 3.1 determinism and single-phase CUB APIs.