Speed, scale and reliability: 25 years of Google data-center networking evolution

2 months ago 111
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Rome wasn’t built in a day, and neither was Google’s network. But 25 years in, we’ve built out network infrastructure with scale and technical sophistication that’s nothing short of remarkable.

It’s all the more impressive because in the beginning, Google’s network infrastructure was relatively simple. But as our user base and the demand for our services grew exponentially, we realized that we needed a network that could handle an unprecedented scale of data and traffic, and that could adapt to dynamic traffic patterns as our workloads changed over time. This ignited a 25-year journey marked by numerous engineering innovations and milestones, ultimately leading to our current fifth-generation Jupiter data center network architecture, which now scales to 13 Petabits/sec of bisectional bandwidth. To put this data rate in perspective, this network could support a video call (@1.5 Mb/s) for all 8 billion people on Earth! 

Today, we have hundreds of Jupiter fabrics deployed around the world, simultaneously supporting hundreds of services, billions of active daily users, all of our Google Cloud customers, and some of the largest ML training and serving infrastructures in the world. I would like to share more about our journey as we look ahead to the next generation of data center network infrastructure.

Guiding principles

Our network evolution has been guided by a few key principles:

  • Anything, anywhere: Our data center networks support efficiency and simplicity by allowing large-scale jobs to be placed anywhere among 100k+ servers within the same network fabric, with high-speed access to needed storage and support services. This scale improves application performance for internal and external workloads and eliminates internal fragmentation. 
  • Predictable, low latency: We prioritize consistent performance and minimizing tail latency by provisioning bandwidth headroom, maintaining 99.999% network availability, and proactively managing congestion through end-host and fabric cooperation.

  • Software-defined and systems-centric: Leveraging software-defined networking (SDN) for flexibility and agility, we qualify and globally release dozens of new features every two weeks across our global network.

  • Incremental evolution and dynamic topology: Incremental evolution helps us to refresh the network granularly (rather than bringing it down wholesale), while dynamic topology helps us to continuously adapt to changing workload demands. The combination of optical circuit switching and SDN supports in-place physical upgrades and an ever-evolving, heterogeneous network that supports multiple hardware generations in a single fabric.

  • Traffic engineering and application-centric QoS: Optimizing traffic flows and ensuring Quality of Service helps us tailor the network to each application's needs.

Integrating across the above principles is the foundation for our work. The network is the foundation of reliability for all other compute services, from storage to AI. As such, the network must fail last and fail least. To support this foundational responsibility, we rigorously define and monitor every bad minute1 across hundreds of clusters and millions of ports across our global network. Our progress on reliability is such that our in-house, software-defined Jupiter networks deliver a factor of 50x more reliability than prior versions of our data center networks. 

2015 - Jupiter, the first Petabit network 

In a seminal paper, we showed that Jupiter data center networks scaled to 1.3 Pb/s of aggregate bandwidth by leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN). This generation of Jupiter was the culmination of five generations of data center networks developed in house by the Google networking team. At that time, this data rate — in one Google data center — was more than the estimated aggregate IP traffic data rate for the global internet. 

2022 - Enabling 6 Petabit per second

In 2022 we announced that our Jupiter networks scaled to over 6 Pb/s, with deep integration of optical circuit switching (OCS), wave division multiplexing (WDM), and a highly scalable Orion SDN controller. These technologies unlocked a range of advancements, including incremental network builds, enhanced performance, reduced costs, lower power consumption, dynamic traffic management, and seamless upgrades.

2023 - 13 Petabit per second network

We have further enhanced Jupiter to support native 400 Gb/s link speeds in the network core. The fundamental building block of Jupiter networks (called the aggregation block) now consists of 512 ports of 400 Gb/s of connectivity both to end hosts and to the rest of the data center, for an aggregate of 204.8 Tb/s of bidirectional non-blocking bandwidth per block. We support 64 such blocks for a total bisection bandwidth of 64*204.8 Tb/s = 13.1 Pb/s. This technology has been powering Google's production data centers for over a year, fueling the rapid advancement of artificial intelligence, machine learning, web search, and other data-intensive applications.

2024 and beyond - Extreme networking in the age of AI

While celebrating over two decades of innovation in data center networking, we’re already charting the course for the next generation of network infrastructure to support the age of AI. For example, our teams are busy working on networking infrastructure needs for our upcoming A3 Ultra VMs, that feature NVIDIA ConnectX-7 networking,  supports non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE (RDMA over converged ethernet) and our future offerings based on NVIDIA GB200 NVL72.

Over the next few years, we will deliver significant advances in network scale and bandwidth, both per-port and network-wide. We will continue to push the boundaries of end-host integration, including the transport and congestion control stack, and streamline network stages to achieve even lower latency with tighter tails. Real-time topology engineering, deeper integration with the compute and storage stacks, and continued refinements to host-based load balancing techniques will further enhance network reliability and latency. With these innovations, our network will remain a cornerstone for the transformative applications and services that enrich the lives of our users throughout the world while simultaneously supporting the groundbreaking AI capabilities that power both our internal services and Google Cloud products.

We are excited to take on these challenges and opportunities to see what the next 25 years hold for Google networking!

Further resources

  • Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM ‘15 [paper]

    • Journey of the first Jupiter datacenter network leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN).

    • First deployed in production in 2012.

  • Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale, arxiv.org, 2022 [paper]

    • First deployed in production in 2013.

  • Orion: Google's Software-Defined Networking Control Plane. NSDI ‘21 [paper]

    • Google's high-performance, scalable, intent-based distributed SDN platform used in both datacenter and wide area networks.

    • First deployed in production in 2016.

  • Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking, SIGCOMM ’22 [paper]

    • Enabling technologies: OCS (2013), Orion SDN (2016), 200Gbps networking (2020), direct-connect topology (2017), dynamic traffic engineering (2018), dynamic topology engineering (2021).

  • Swift: Delay is Simple and Effective for Congestion Control in the Datacenter, SIGCOMM ‘20 [paper]

    • Swift, a congestion control protocol using hardware timestamps and AIMD control with a delay target, delivers excellent performance in Google datacenters with low flow completion times for short RPCs and high throughput for long RPCs.

    • First deployed in production in 2017

  • PLB: Congestion Signals are Simple and Effective for Network Load Balancing, SIGCOMM ‘22 [paper]

    • Protective Load Balancing (PLB) is a simple, effective host-based load balancing design that reduces network congestion and improves performance by randomly changing paths for congested connections, preferring to repath after idle periods to minimize packet reordering.

    • First deployed in production in 2020


1. Any minute where a statistically significant number of network flows in the data center network experience a total or partial outage above a defined threshold.

Posted in
Read Entire Article