As generative AI experiences explosive growth fueled by advancements in LLMs (Large Language Models), access to open models is more critical than ever for developers. Open models are publicly available pre-trained foundational LLMs. Platforms like Google Cloud’s Vertex AI, Kaggle and Hugging Face already provide easy access to open models to data scientists, ML engineers, and application developers.
Some of these models require powerful infrastructure and deployment capabilities, which is why today we're excited to announce the capability to deploy and serve open models such as Llama 3.1 405B FP16 LLM over GKE (Google Kubernetes Engine). Published by Meta, Llama 3.1 with 405 billion parameters demonstrates significant improvements in general knowledge, reasoning abilities, and coding proficiency. When run at FP (floating point) 16 precision to store and process 405 billion parameters, the model requires more than 750GB GPU memory for inference. The GKE solution described in this article makes the challenge of deploying and serving such large models easier to achieve.
Customer experience
As a Google Cloud customer, you can find the Llama 3.1 LLM by going to Vertex AI Model Garden and selecting the Llama 3.1 model tile.
After clicking the deploy button, you can select GKE and pick the Llama 3.1 405B FP16 model.
On this page, you can find the auto generated Kubernetes yaml and detailed instructions for deployment and serving Llama 3.1 405B FP16.
Multi-host deployment and serving
Llama 3.1 405B FP16 LLM requires more than 750 GB GPU memory and presents considerable challenges for deployment and serving. In addition to the memory consumed by model weights, factors such as KV (Key-Value) cache storage and longer sequence length support also contribute to the overall memory requirements. Currently the most powerful GPU offering in the Google Cloud platform, the A3 virtual machines, is equipped with 8 H100 Nvidia GPUs, each featuring 80 GB of HBM (High-Bandwidth Memory). For serving LLMs like the FP16 Llama 3.1 405B model, multi-host deployment and serving is the only viable solution. We use LeaderWorkerSet with Ray and vLLM to deploy over GKE.
LeaderWorkerSet
The LeaderWorkerSet (LWS) is a deployment API specifically developed to address the workload requirements of multi-host inference, facilitating the sharding and execution of the model across multiple devices on multiple nodes. Constructed as a Kubernetes deployment API, LWS is both cloud agnostic and accelerator agnostic, and can run on both GPUs and TPUs. LWS leverages the upstream StatefulSet API as its fundamental building block, as illustrated below.
Within the LWS architecture, a group of pods is managed as a singular entity. Each pod within this group is assigned a unique index ranging from 0 to n-1, with the pod bearing the index 0 designated as the leader of the group. The creation of each pod within the group is executed concurrently, and they share an identical lifecycle. LWS facilitates rollout and rolling updates at the group level. Each group is regarded as a single unit for rolling updates, scaling, and mapping to an exclusive topology for placement. The upgrade process for each group is executed as a single atomic unit, ensuring that all pods within the group are updated simultaneously. Co-location of all pods within the same group in the same topology is permissible, with optional support for topology-aware placement. The group is treated as a single entity in the context of failure handling as well, with optional all-or-nothing restart support. When enabled, all pods within the group will be recreated if a single pod in the group fails or if a single container within any of the pods is restarted.
Within the LWS framework, the concept of a replica encompasses a group consisting of a single leader and a set of workers. LWS supports dual templates, one designated for the leader and the other for the workers. LWS provides a scale endpoint for HPA, enabling dynamic scaling of the number of replicas.
Multi-host deployment with vLLM and LWS
vLLM is a popular open source model server and supports multi-node multi-GPU inference by employing tensor parallelism and pipeline parallelism. vLLM supports distributed tensor parallelism with Megatron-LM’s tensor parallel algorithm. For pipeline parallelism, vLLM manages the distributed runtime with Ray for multi-node inferencing.
Tensor parallelism involves horizontally partitioning the model across multiple GPUs, resulting in the tensor parallel size being equivalent to the number of GPUs within each node. It is important to note that this approach necessitates fast network communication among the GPUs.
On the other hand, pipeline parallelism vertically partitions the model by layer and does not demand constant communication between GPUs. Typically, this corresponds to the number of nodes employed for multi-host serving.
The combination of these parallelism strategies is essential to accommodate the entirety of the Llama 3.1 405B FP16 model. Two A3 nodes, each equipped with 8 H100 GPUs, will provide an aggregate memory capacity of 1280 GB, sufficient to accommodate the model's 750 GB memory requirement. This configuration will also provide the necessary buffer memory for the key-value (KV) cache and support long context lengths. For this LWS deployment, the tensor parallel size is set to 8, while the pipeline parallel size is set to 2.
Summary
In this blog, we showed you how LWS gives you the essential capabilities required for multi-host serving. This technique can also serve smaller models, such as Llama 3.1 405B FP8, on more cost-effective machines, which optimizes price-to-performance ratios. To learn more, visit this blog post that shows how to pick a machine type that fits your model. LWS is open sourced and has strong community engagements – take a look at our Github to learn more and contribute directly.
As Google Cloud Platform helps customers adopt a gen AI workload, you can come to Vertex AI Model Garden to deploy and serve open models over managed Vertex AI backend or GKE DIY (Do It Yourself) clusters. Our goal is to create a seamless customer experience, one example is multi-host deployment and serving. We look forward to hearing your feedback.
Posted in