该文（2026年4月2日，作者Anu…

该文（2026年4月2日，作者Anu Srivastava）将Gemma 4定位为一整套模型，覆盖从数据中心到边缘及端侧部署，强调在安全、低延迟和低成本场景中的价值。该系列强调复杂推理、代码生成与调试、结构化工具调用、视觉/视频/音频能力，以及在单一提示词中交错文本与图像输入。文中称其预训练语种超过140种，开箱即支持35种以上，显示语言覆盖率高于以往。

Gemma 4共有四个模型，其中包含31B稠密模型和26B-A4B MoE模型（128个专家）；有效参数分别为3.8B（26B-A4B）、4.5B（7.9B的E4B）和2.3B（5.1B的E2B）。对应有效/总参数比例约为14.6%、56.9%和45.1%。上下文长度为31B与26B-A4B版本256K token，E4B/E2B为128K token，后者在滑动窗口上为512 token。该套件可在单张NVIDIA H100上运行，并按不同变体支持文本、音频、视觉和视频多模态。

在部署层面，DGX Spark配套GB10 Grace Blackwell超级芯片和128GB统一内存，可在本地运行31B BF16模型；面向Blackwell的NVFP4量化检查点预计即将发布。NVIDIA边缘方案覆盖RTX/RTX Pro及Jetson，从Jetson Orin Nano到Thor。vLLM、Ollama和llama.cpp提供本地推理，Unsloth与NVIDIA NeMo Automodel支持快速部署与微调（SFT与高效LoRA）；NeMo的Day 0微调可直接基于Hugging Face检查点。生产场景可先用NVIDIA API目录中的免费31B NIM原型，再过渡到企业许可的自托管NIM微服务。文章总结其Apache 2.0许可利于跨规模、跨场景快速落地，并在速度、安全和成本之间平衡。

The article, dated Apr 02, 2026 and written by Anu Srivastava, positions Gemma 4 as a single model family that spans data-center, edge, and on-device deployment, emphasizing secure, low-latency, and cost-sensitive use cases. It highlights capability areas including reasoning, code generation and debugging, structured tool calling, and vision/video/audio support, with multimodal prompts that can mix text and images in one query. It also reports pre-training on over 140 languages and out-of-the-box support for more than 35 languages, indicating broader language coverage than earlier generations.

Gemma 4 includes four models, including a 31B dense model and a 26B-A4B MoE model with 128 experts; effective parameters are 3.8B for the 26B-A4B, 4.5B for the 7.9B E4B, and 2.3B for the 5.1B E2B. The corresponding effective-to-total ratios are about 14.6%, 56.9%, and 45.1%. Context lengths are 256K tokens for the 31B and 26B-A4B, and 128K for E4B/E2B, with 512-token sliding windows on the latter. The set is said to fit on a single NVIDIA H100 and supports text, audio, vision, and video modalities depending on variant.

For deployment, DGX Spark with the GB10 Grace Blackwell superchip and 128GB unified memory can run the 31B BF16 model locally, and an NVFP4-quantized 31B checkpoint for Blackwell is expected soon. NVIDIA maps edge workflows across RTX/RTX Pro and Jetson, scaling from Jetson Orin Nano to Jetson Thor. Local inference stacks use vLLM, Ollama, and llama.cpp, while Unsloth and NVIDIA NeMo Automodel support fast deployment and day-0 fine-tuning via SFT and memory-efficient LoRA directly from Hugging Face checkpoints. For production, teams can prototype with the free 31B NIM API from NVIDIA’s catalog and move to self-hosted NIM microservices under enterprise licensing. The article concludes that Apache 2.0 licensing enables quick adoption across scales while balancing speed, security, and cost.