The revolution in generative AI (gen AI) and large language models (LLMs) is leading to larger model sizes and increased demands on the compute infrastructure. Organizations looking to integrate these advancements into their applications increasingly require distributed computing solutions that offer minimal scheduling overhead. As the need for scalable gen AI solutions grows, Ray, an open-source Python framework designed for scaling and distributing AI workloads, has become increasingly popular.
Traditional Ray deployments on virtual machines (VMs) have limitations when it comes to scalability, resource efficiency, and infrastructure manageability. One alternative is to leverage the power and flexibility of Kubernetes and deploy Ray on Google Kubernetes Engine (GKE) with KubeRay, an open-source Kubernetes operator that simplifies Ray deployment and management.
“With the help of Ray on GKE, our AI practitioners are able to get easy orchestration of infrastructure resources, flexibility and scalability that their applications need without the headache of understanding and managing the intricacies of the underlying platform.” - Nacef Labidi, Head of Infrastructure, Instadeep
In this blog, we discuss the numerous benefits that running Ray on GKE brings to the table — scalability, cost-efficiency, fault tolerance and isolation, and portability, to name a few — and resources on how to get started.
Easy scalability and node auto-provisioning
On VMs, Ray's scalability is inherently limited by the number of VMs in the cluster. Autoscaling and node provisioning, configured for specific clouds (example), require detailed knowledge of machine types and network configurations. In contrast, Kubernetes orchestrates infrastructure resources using containers, pods, and VMs as scheduling units, while Ray distributes data-parallel processes within applications, employing actors and tasks for scheduling.
KubeRay introduces cloud-agnostic autoscaling to the mix, allowing you to define minimum and maximum replicas within the workerGroupSpec. Based on this configuration, the Ray autoscaler schedules more Kubernetes pods as required by its tasks. And if you choose the GKE Autopilot mode of operation, node provisioning happens automatically, eliminating the need for manual configuration.
Greater efficiency and improved startup latency
GKE offers discount-based savings such as committed use discounts, new pricing model and reservations for GPUs in Autopilot mode. In addition, GKE makes it easy to taking advantage of cost-saving measures like spot nodes via YAML configuration.
Low startup latency is critical to optimal resource usage, ensuring quick recovery, faster iterations and elasticity. GKE image streaming lets you initialize and run eligible container images from Artifact Registry, without waiting for the full image to download. Testing demonstrated containers going from `ray-ml` container image going from `ContainerCreating` to `Running` state in 8.82s, compared to 5m17s without image streaming — that’s 35x faster! Image streaming is automatically enabled on Autopilot clusters and available on Standard clusters.
Automated infrastructure management for fault tolerance and isolation
Managing a Ray cluster on VMs offers control over fault tolerance and isolation via detailed VM configuration. However, it lacks the automated, portable self-healing capabilities that Kubernetes provides.
Kubernetes excels at repeatable automation that is expressed with clear declarative and idempotent desired state configuration. It provides automatic self-healing capabilities, which in Kuberay 2.0 or later extends to preventing the Ray cluster from crashing when the head node goes down.
In fact, Ray Serve docs specifically recommend Kubernetes for production workloads, using the RayService custom resource to automatically handle health checking, status reporting, failure recovery and upgrades.
On GKE, the declarative YAML-based approach not only simplifies deployment and management but can also be used to provision security and isolation. This is achieved by integrating Kubernetes' RBAC with Google Cloud's Identity and Access Management (IAM), allowing administrators to finely tune the permissions granted to each Ray cluster. For instance, a Ray cluster that requires access to a Google Cloud Storage bucket for data ingestion or model storage can be assigned specific roles that limit its actions to reading and writing to that bucket only. This is configured by specifying the Kubernetes service account (KSA) as part of the pod template for Ray cluster `workerGroupSpec` and then linking a Google Service account with appropriate permissions to the KSA using the workload identity annotation.
Easy multi-team sharing with Kubernetes namespaces
Out of the box, Ray does not have any security separation between Ray clusters. With Kubernetes you can leverage namespaces to create a Ray cluster per team, and use Kubernetes Role-Based Access Control (RBAC), Resource Quotas and Network Policies. This creates a namespace-based trust boundary to allow multiple teams to each manage their Ray clusters within a larger shared Kubernetes cluster.
Flexibility and portability
You can use Kubernetes for more than just data and AI. As a general-purpose platform, Kubernetes is portable across clouds and on-premises, and has a rich ecosystem.
With Kubernetes, you can mix Ray and non-Ray workloads on the same infrastructure, allowing the central platform team to manage a single common compute layer, while leaving infrastructure and resource management to GKE. Think of it as your own personal SRE.
Get started with Kuberay on GKE
In conclusion, running Ray on GKE is a straightforward way to achieve scalability, cost-efficiency, fault tolerance and isolation for your production workloads, all while ensuring cloud portability. You get the flexibility to adapt quickly to changing demands, making it an ideal choice for forward-thinking organizations in an ever-evolving generative AI landscape.
To get started with Kuberay on GKE, follow these instructions. This repo has Terraform templates to run Kuberay on GPUs and TPUs, and examples for training and serving. You can also find more tutorials and code samples at AI/ML on GKE page.