The next frontier of AI is reasoning models that think critically and learn during inference to solve complex problems. To train and serve this new class of models, you need infrastructure with the performance and efficiency to not only handle massive datasets and context windows, but also to deliver rapid, reliable responses. To continue to push the boundaries, you need a system that is built to handle as-of-yet unknown requirements.
Today, we're excited to announce the preview of A4X VMs, powered by NVIDIA GB200 NVL72, a system consisting of 72 NVIDIA Blackwell GPUs and 36 Arm-based NVIDIA Grace CPUs connected via fifth-generation NVIDIA NVLink. With this integrated system, A4X VMs directly address the significant compute and memory demands of reasoning models that use chain-of-thought, unlocking new levels of AI performance and accuracy.
Google Cloud is the first and only cloud provider today to offer both A4 VMs powered by NVIDIA B200 GPUs and A4X VMs powered by NVIDIA GB200 NVL72.
Key A4X features and capabilities
A4X VMs are built upon several key innovations to enable the next frontier of AI:
-
NVIDIA GB200 NVL72: This configuration enables 72 Blackwell GPUs to function as a single, unified compute unit, with shared memory and high-bandwidth communications. For example, this unified architecture helps achieve low-latency responses for multimodal reasoning across concurrent inference requests.
-
NVIDIA Grace CPUs: These custom Arm chips come with NVLink chip-to-chip (C2C) connections to the Blackwell GPUs that enable efficient checkpointing, offloading and rematerializing of the model and optimizer state that’s required to train and serve the largest models.
-
Enhanced training performance: With more than 1 exaflop per GB200 NVL72 system, A4X offers a 4X increase in LLM training performance compared to the A3 VMs powered by NVIDIA H100 GPUs.
-
Scalability and parallelization: A4X VMs facilitate the deployment of models across tens of thousands of Blackwell GPUs using the latest sharding and pipelining strategies to maximize GPU utilization. Google Cloud’s high-performance networking based on RDMA over Converged Ethernet (RoCE) combines NVL72 racks into single, rail-aligned, non-blocking clusters of tens of thousands of GPUs. This isn't just about size; it's about efficiently scaling your most complex models.
-
Optimized for reasoning and inference: The A4X architecture with its 72-GPU NVLink domain is specifically designed for low-latency inference, especially for reasoning models that employ chain-of-thought techniques. The ability to share memory and workload across all 72 GPUs (including the KVCache for long-context models) provides low latency, while the large NVLink domain also leads to better batch size scaling and lower TCO, so you can serve more concurrent user requests.
The Google Cloud advantage
A4X VMs are part of our supercomputing architecture, AI Hypercomputer, and benefit from Google Cloud’s data center, infrastructure, and software expertise. With the power of AI Hypercomputer, A4X customers can take advantage of:
-
Hypercompute Cluster: Hypercompute Cluster lets you deploy and manage large clusters of A4X VMs with compute, storage and networking as a single unit. This makes it easy to manage complexity while delivering exceptionally high performance and resilience for large distributed workloads. Specifically for A4X, Hypercompute Cluster’s topology-aware scheduling algorithm is aware of the NVL72 domains, and ensures that workloads can take advantage of the high-bandwidth NVLink. It also provides observability across the GPUs, NVLink network, and DC networking fabric including NCCL profiling to help infrastructure teams detect and resolve issues quickly.
-
High-performance networking fabric: A4X VM includes the Titanium ML network adapter based on NVIDIA ConnectX-7 network interface cards (NICs). Titanium ML adapter delivers Google Cloud's agility and security without compromising performance required for ML workloads. The A4X system delivers 28.8Tbps (72*400Gbps) of non-blocking GPU-to-GPU traffic with RoCE. A4X uses a rail-optimized network design, which reduces latency for GPU collectives and improves performance. Our Jupiter network fabric then allows us to combine NVL72 domains and scale to tens of thousands of GPUs in a single non-blocking cluster.
-
Advanced liquid cooling: A4X VMs are cooled by Google’s third-generation liquid cooling infrastructure. The consistent, efficient cooling is essential to prevent thermal throttling and maintain peak computational performance. Our liquid-cooling infrastructure is based on learnings over years of global operational experience. Since we’ve mastered the complexities of deploying and managing liquid-cooled infrastructure at scale, A4X will be available across a broader range of Google Cloud regions, accelerating access to this powerful technology for customers worldwide.
-
Software ecosystem optimization: Especially for the A4X system with Arm-based hosts, software choices are critical. We have collaborated with NVIDIA to ensure that you have access to performance-optimized software, including libraries and drivers that work well with popular frameworks like PyTorch and JAX. Look out for GPU recipes to help you get started with your inference and training workloads.
Native integration across Google Cloud
With A4X you can easily integrate across Google Cloud products and services.
-
Storage: A4X VMs are natively integrated with Cloud Storage FUSE for 2.9x better training throughput compared to native ML framework dataloaders, and Hyperdisk ML accelerates model load time by up to 11.9x compared to common alternatives.
-
Google Kubernetes Engine (GKE): As part of Google Cloud Industry Leading Container Management platform, GKE and A4X VMs are a powerful combination, maximizing resource usage while scaling AI/ML training and serving workloads. With the ability to handle up to 65,000 nodes per cluster, this combination makes it possible to run extra-large-scale AI workloads with low-latency inference and workload sharing across 72 GPUs, unlocking new AI performance possibilities.
-
Vertex AI Platform: Vertex AI is a fully-managed, open and integrated AI development platform to accelerate your AI project. Easily train, tune or deploy ML models with access to the latest Gemini models from Google or choose from a wide variety of models and open models.
A strategic partnership
Additionally, NVIDIA DGX Cloud, a fully managed AI platform will soon be available on A4X VMs to fast-track customers' AI initiatives.
"Developers and researchers need access to the latest technology to train and deploy AI models for specific applications and industries. Our collaboration with Google provides customers with enhanced performance and scalability, enabling them to tackle the most demanding generative AI, LLM, and scientific computing workloads while benefiting from the ease of use and global reach of Google Cloud." - Alexis Bjorlin, VP of NVIDIA DGX Cloud, NVIDIA
Customers like Magic have chosen to build their cutting-edge models on Google Cloud’s A4X VMs.
“We are excited to partner with Google and NVIDIA to build our next-gen AI supercomputer on Google Cloud. Google Cloud’s A4X VMs powered by NVIDIA’s GB200 NLV72 system will greatly improve inference and training efficiency for our models, and Google Cloud offers us the fastest timeline to scale, and a rich ecosystem of cloud services.” – Eric Steinberger, CEO & Co-founder, Magic
Choosing the right VM: A4 vs. A4X
Google Cloud offers both A4 VMs powered by NVIDIA B200 GPUs and now A4X VMs powered by NVIDIA GB200 NVL72. Here's a guide to choosing what’s best for your workload:
-
A4X VMs (powered by NVIDIA GB200 NVL72 GPUs): Purpose-built for training and serving the most demanding, extra large-scale AI workloads, particularly those involving reasoning models, large language models with long context windows, and scenarios that require massive concurrency. This is enabled by the unified memory across a large GPU domain.
-
A4 VMs (powered by NVIDIA B200 GPUs): A4 provides excellent performance and versatility for diverse AI model architectures and workloads, including training, fine-tuning, and serving. A4 offers easy portability from prior generations of Cloud GPUs and optimized performance benefits for varying scaled training jobs.
For more information about A4X, please reach out to your Google Cloud representative. Learn more about how A4X can help your business at Google Cloud Next.
Posted in