Sustainable silicon to intelligent clouds: collaborating for the future of computing

2 months ago 17

News Banner

Editor’s note: Today, we hear from Parthasarathy Ranganathan, Google VP and Technical Fellow and Amber Huffman, Principal Engineer. Partha delivered a keynote address today at the 2024 OCP Global Summit, an annual conference for leaders, researchers, and pioneers in the open hardware industry. Amber is on the board of directors at the Open Compute Project (OCP). Read on to hear about the past and future of hyperscale computing, and an overview of all of our activities in the OCP community.

We are in an exciting era of hyperscale computing, one where a new wave of innovations is building the foundation for AI/ML computing in the cloud. Building on Google’s rich 25-year history in hyperscale computing, we look ahead to how co-design and collaboration — across the hardware-software stack, disciplines, and communities — will be key to this exciting new future.

From scrappy beginnings to societal infrastructure

When Google was founded in 1998, it was clear that successful web search would require enormous amounts of computing power and storage. This led to the design of the very first hyperscale computers specialized for search. These early makeshift systems included creative cost-reduction approaches like corkboard servers and off-the-shelf fans from Walmart, and they set the stage for the hardware-software co-design and workload-specific specialization principles that we follow to this day.

Building on these first systems, over the subsequent decade, Google laid the groundwork for modern hyperscale computing, pioneering custom servers, custom networking, and custom data centers, and expanding our services beyond search to include Gmail, YouTube, and Android. All of this presaged the modern multi-workload cloud. During this period, we also developed essential systems software like Borg, Colossus, MapReduce, and Bigtable. In the following years, we focused on scaling these systems, while also prioritizing security, reliability, and power efficiency. The formation of the Open Compute Project (OCP) in 2011 marked the transition of hyperscale computing from niche discipline to more mainstream offering. In the current decade, hyperscale computing is characterized by innovations to counter the slowing of Moore’s law: specialized hardware to support machine learning and video processing as well as software-defined servers to manage heterogeneity.

Today, hyperscale computing has truly come into its own, evolving into the crucial societal infrastructure that drives cloud and AI workloads.

Cross-disciplinary co-design: the heart of innovation

Across all these Google innovations over the past 25 years, one theme has remained constant: a strong commitment to cross-disciplinary systems innovation and co-design. Looking ahead to the AI era, we continue to take a holistic approach: from “mud to cloud” — starting at the very ground on which we build our data centers up to to broader cloud computing services; and from “chip to ship” — designing hardware that we then deploy and use in production. This philosophy has driven some incredible efficiency gains, delivering orders-of-magnitude improvements across multiple generations of systems.

Take our Tensor Processing Units (TPUs). Multiple generations of these purpose-built AI accelerators (including our latest Trillium TPU) have driven significant advances in machine learning, including large-language models like Gemini and Nobel-prize-winning scientific breakthroughs like AlphaFold. However, we’ve gone beyond just chip design to considering the entire system that surrounds them. We've coupled TPUs with innovations like liquid cooling, advanced networking systems featuring cutting-edge optics and topology awareness, and a commitment to sustainable power, all in the service of creating a truly amazing AI platform. We've then layered open software frameworks like JAX, TensorFlow, OpenXLA, and Kubernetes on top of this hardware foundation, creating what we call the AI Hypercomputer. This hypercomputer is further enhanced by integrating with model gardens and applications, creating a vertically integrated ecosystem that's optimized for AI workloads.

Cross-industry collaboration: from ideas to impact

But there’s also another aspect of holistic co-design that has served us well: cross-industry collaborations, i.e., building standards and ecosystems. Our partnership with OCP is an important example of this. Since formally joining OCP in 2016, we’ve continued to grow our contributions year after year. Looking ahead, we want to highlight progress and opportunities in four key areas.

Sustainability
Last year, Google, along with fellow hyperscalers, rallied the industry to reduce carbon emissions with an ambitious roadmap towards greener concrete. We have since made good progress, collaborating to develop new metrics and benchmarks, identifying streamlined data center designs that minimize concrete use, and even using AI to research new materials. At a recent event, we demonstrated proof-of-concept concrete mixtures that can reduce carbon emissions by 20% to 40%.

As we work towards net-zero emissions by 2030 across our operations and value chain, there’s a lot more we can do. At OCP this year, we are discussing how to develop product category rules (PCRs) to accurately measure hardware emissions across the lifecycle, make more high-quality carbon data available, and develop clean reliable power backup for our data centers. Further, we’re continuing to look holistically at all aspects of our energy consumption, carbon footprint, and water usage.

Trusted silicon
Trusted silicon is a foundational element of hyperscaler systems. Over the past three years, we have collaborated on Caliptra, a re-usable IP block for root-of-trust management, and delivered an open-source implementation of Caliptra 1.0 that is being integrated by companies across the ecosystem. Google's future TPUs and ARM SoCs will also include Caliptra. Leveraging Caliptra, the OCP L.O.C.K. project will provide layered open-source cryptographic key management for storage devices, improving both trust and sustainability.

In the area of silicon reliability, we are continuing our industry-academia collaborations around a systems approach to addressing silicon faults and silent data errors, including funding six leading academic institutions for novel research. The Server Component Resilience (SDC) Specification discusses the opportunities ahead with standardized information exchange and test metrics and open frameworks for detecting and mitigating errors.

AI accelerators
AI represents a fundamental platform shift requiring us to innovate across hardware and software. Google has played an active role in driving standardization efforts for AI accelerators, particularly in areas like low-precision data formats (e.g., OCP FP8 and MX), software frameworks (e.g., OpenXLA, JAX, TensorFlow), and networking (Falcon, Ultra Ethernet, Ultra Accelerator Link). Working with other hyperscalers and GPU suppliers, we have also aligned on common specifications for firmware updates, management interfaces, and RAS (reliability, availability, serviceability).

But as AI continues to drive exponential demands on computing, we can do more. As part of the OCP AI Strategic initiative, we are sharing learnings from deploying over 1 GW of liquid cooled infrastructure to help the industry scale this capability. We are also identifying new power-delivery solutions, from chips to racks to data centers. Notably, akin to how Google led the industry with 48V racks, at OCP Summit this year, we are proposing 400V DC distribution and rack solutions that can significantly improve data center density and efficiency.

Systems infrastructure
Finally, we continue to make great progress on foundational systems infrastructure. Google's contributions this past year span contributions to NVM Express for the data center (e.g., security enhancements, open test repositories), servers (e.g., OpenTitan platform root of trust), and networking (Falcon, SONiC advancements in telemetry and simulation, advanced PCIe enclosure compatible form factor), as well as new efforts such as the open-source random shock and vibration testing. At the same time, we’ve gone beyond technical contributions to form and co-chair the OCP Advisory Board as well as guide the formation of the OCP AI Strategic Initiative.

Looking ahead, we will continue to keep innovating in this space, particularly to meet the next level of scale required by AI infrastructure. Notably, at the OCP Summit this year, we are discussing the adoption of robotics and automation for data centers. Across a range of activities (material movement, monitoring/inspection, servicing/repair, media management), robotics enable data center operations to scale safely and sustainably, and present a fundamental shift in how we build these facilities.

Innovating for the new intelligence revolution

We have a lot to be proud of over the past 25 years of hyperscale computing, but the best is yet to come. With AI, we are at an exciting inflection point in computing: the beginning of the new intelligence revolution. Akin to prior shifts — the industrial revolution for manufacturing or the information revolution with the mobile internet — this revolution will have a profound impact on both technology and society, and holistic system innovations will be key to enabling it. We look forward to collaborating with all of you on this exciting journey.

Posted in

Systems

Read Entire Article

Sustainable silicon to intelligent clouds: collaborating for the future of computing

Looking for an Interim or Fractional CTO to support your business?

From scrappy beginnings to societal infrastructure

Cross-disciplinary co-design: the heart of innovation

Cross-industry collaboration: from ideas to impact

Innovating for the new intelligence revolution