Unlocking LLM training efficiency with Trillium — a performance analysis

1 month ago 19
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Rapidly evolving generative AI models place unprecedented demands on the performance and efficiency of hardware accelerators. Last month, we launched our sixth-generation Tensor Processing Unit (TPU), Trillium, to address the demands of next-generation models. Trillium is purpose-built for performance at scale, from the chip to the system to our Google data center deployments, to power ultra-large scale training.

Today, we present our first MLPerf training benchmark results for Trillium. The MLPerf 4.1 training benchmarks show that Trillium delivers up to 1.8x better performance-per-dollar compared to prior-generation Cloud TPU v5p and an impressive 99% scaling efficiency (throughput).

In this blog, we offer a concise analysis of Trillium's performance, demonstrating why it stands out as the most performant and cost-efficient TPU training system to date. We begin with a quick overview of system comparison metrics, starting with traditional scaling efficiency. We introduce convergence scaling efficiency as a crucial metric to consider in addition to scaling efficiency. We assess these two metrics along with performance per dollar and present a comparative view of Trillium against Cloud TPU v5p. We conclude with guidance that you can use to make an informed choice for your cloud accelerators.

Traditional performance metrics

Accelerator systems can be evaluated and compared across multiple dimensions, ranging from peak throughput, to effective throughput, to throughput scaling efficiency. Each of these metrics are helpful indicators but do not take convergence time into consideration.

Hardware specifications and peak performance

Traditionally, comparisons focused on hardware specifications like peak throughput, memory bandwidth, and network connectivity. While these peak values establish theoretical boundaries, they are bad at predicting real-world performance, which depends heavily on architectural design and software implementation. Since modern ML workloads typically span hundreds or thousands of accelerators, the key metric is the effective throughput of an appropriately sized system for specific workloads.

Utilization performance

System performance can be quantified through utilization metrics like effective model FLOPS utilization (EMFU) and memory bandwidth utilization (MBU), which measure achieved throughput versus peak capacity. However, these hardware efficiency metrics don't directly translate to business-value measures like training time or model quality.

Scaling efficiency and trade-offs

A system's scalability is evaluated through both strong scaling (performance improvement with system size for fixed workloads) and weak scaling (efficiency when increasing both workload and system size proportionally). While both metrics are valuable indicators, the ultimate goal is to achieve high-quality models quickly, sometimes making it worthwhile to trade scaling efficiency for faster training time or better model convergence.

The need for convergence scaling efficiency

While hardware utilization and scaling metrics provide important system insights, convergence scaling efficiency focuses on the fundamental goal of training: reaching model convergence efficiently. Convergence refers to the point where a model’s output stops improving and the error rate becomes constant. Convergence scaling efficiency measures how effectively additional computing resources accelerate the training process to completion.

We define convergence scaling efficiency using two key measurements: the base case, where a cluster of N₀ accelerators achieves convergence in time T₀, and a scaled case with N₁ accelerators taking time T₁ to converge. The ratio of the speedup in convergence time to the increase in cluster size gives us:

A convergence scaling efficiency of 1 indicates that time-to-solution improves by the same ratio as the cluster size. It is therefore desirable to have convergence scaling efficiency as close to 1 as possible.

Now let’s apply these concepts to understand our ML Perf submission for GPT3-175b training task using Trillium and Cloud TPU v5p.

Trillium performance

We submitted GPT3-175b training results for four different Trillium configurations, and three different Cloud TPU v5p configurations. In the following analysis, we group the results by cluster sizes with the same total peak flops for comparison purposes. For example, the Cloud TPU v5p-4096 configuration is compared to 4xTrillium-256, and Cloud TPU v5p-8192 is compared with 8xTrillium-256, and so on.

All results presented in this analysis are based on MaxText, our high-performance reference implementation for Cloud TPUs and GPUs.

Weak scaling efficiency

For increasing cluster sizes with proportionately larger batch-sizes, both Trillium and TPU v5p deliver near linear scaling efficiency:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_IRqkbLo.max-1400x1400.pnghttps://storage.googleapis.com/gweb-cloudblog-publish/images/image1_IRqkbLo.max-1400x1400.png

Figure-1: Weak scaling comparison for Trillium and Cloud TPU v5p. v5p-4096 and 4xTrillium-256 are considered as base for scaling factor measurement. n x Trillium-256 corresponds to n Trillium pods with 256 chips in one ICI domain. v5p-n corresponds to n/2 v5p chips in a single ICI domain.

Figure 1 demonstrates relative throughput scaling as cluster sizes increase from the base configuration. Trillium achieves 99% scaling efficiency even when operating across data-center networks using Cloud TPU multislice technology, outperforming the 94% scaling efficiency of Cloud TPU v5p cluster within a single ICI domain. For these comparisons, we used a base configuration of 1024 chips (4x Trillium-256 pods), establishing a consistent baseline with the smallest v5p submission (v5p-4096; 2048 chips). When measured against our smallest submitted configuration of 2x Trillium-256 pods, Trillium maintains a strong 97.6% scaling efficiency.

Convergence scaling efficiency

As stated above, weak scaling is useful but not a sufficient indicator of value, while convergence scaling efficiency brings time-to-solution into consideration.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image4_zkJuHRW.max-1200x1200.pnghttps://storage.googleapis.com/gweb-cloudblog-publish/images/image4_zkJuHRW.max-1200x1200.png

Figure-2: Convergence scaling comparison for Trillium and Cloud TPU v5p.

For the largest cluster size, we observed comparable convergence scaling efficiency for Trillium and Cloud TPU v5p. In this example, a CSE of 0.8 means that for the rightmost configuration, the cluster size was 3x the (base) configuration, while the time to convergence improved by 2.4x with respect to the base configuration (2.4/3 = 0.8).

While the convergence scaling efficiency is comparable between Trillium and TPU v5p, where Trillium really shines is by delivering the convergence at a lower cost, which brings us to the last metric.

Cost-to-train

While weak scaling efficiency and convergence scaling efficiency indicate scaling properties of systems, we’ve yet to look at the most crucial metric: the cost to train.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image2_9nJDm58.max-1200x1200.pnghttps://storage.googleapis.com/gweb-cloudblog-publish/images/image2_9nJDm58.max-1200x1200.png

Figure-3: Comparison of cost-to-train based on the wall-clock time and the on-demand list price for Cloud TPU v5p and Trillium.

Trillium lowers the cost to train by up to 1.8x (45% lower) compared to TPU v5p while delivering convergence to the same validation accuracy.

Making informed cloud accelerator choices

In this article, we explored the complexities of comparing accelerator systems, emphasizing the importance of looking beyond simple metrics to assess true performance and efficiency. We saw that while peak performance metrics provide a starting point, they often fall short in predicting real-world utility. Instead, metrics like Effective Model Flops Utilization (EMFU) and Memory Bandwidth Utilization (MBU) offer more meaningful insights into an accelerator's capabilities.

We also highlighted the critical importance of scaling characteristics — both strong and weak scaling — in evaluating how systems perform as workloads and resources grow. However, the most objective measure we identified is the convergence scaling efficiency, which ensures that we're comparing systems based on their ability to achieve the same end result, rather than just raw speed.

Applying these metrics to our benchmark submission with GPT3-175b training, we demonstrated that Trillium achieves comparable convergence scaling efficiency to Cloud TPU v5p while delivering up to 1.8x better performance per dollar, thereby lowering the cost-to-train. These results highlight the importance of evaluating accelerator systems through multiple dimensions of performance and efficiency.

For ML-accelerator evaluation, we recommend a comprehensive analysis combining resource utilization metrics (EMFU, MBU), scaling characteristics, and convergence scaling efficiency. This multi-faceted approach enables you to make data-driven decisions based on your specific workload requirements and scale.

To learn more about Trillium, please review the launch blog or our documentation.

Posted in
Read Entire Article