How to deploy Llama 3.2-1B-Instruct model with Google Cloud Run GPU

1 month ago 12
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

As open-source large language models (LLMs) become increasingly popular, developers are looking for better ways to access new models and deploy them on Cloud Run GPU. That’s why Cloud Run now offers fully managed NVIDIA GPUs, which removes the complexity of driver installations and library configurations. This means you’ll benefit from the same on-demand availability and effortless scalability that you love with Cloud Run's CPU and memory, with the added power of NVIDIA GPUs. When your application is idle, your GPU-equipped instances automatically scale down to zero, optimizing your costs.

In this blog post, we'll guide you through deploying the Meta Llama 3.2 1B Instruction model on Cloud Run. We'll also share best practices to streamline your development process using local model testing with Text Generation Inference (TGI) Docker image, making troubleshooting easy and boosting your productivity.

Why Cloud Run with GPU?

There are four critical reasons developers benefit from deploying open models on Cloud Run with GPU:

  • Fully managed: No need to worry about drivers, libraries, or infrastructure.

  • On-demand scaling: Scale up or down automatically based on demand.

  • Cost effective: Only pay for what you use, with automatic scaling down to zero when idle.

  • Performance: NVIDIA GPU-optimized for Meta Llama 3.2.

Initial Setup

  • First, create a Hugging Face token. 

  • Second, check that your Hugging Face token has permission to access and download Llama 3.2 model weight here. Keep your token handy for the next step.

  • Third, use Google Cloud's Secret Manager to store your Hugging Face token securely. In this example, we will be using Google user credentials. You may need to authenticate for using gcloud CLI, setting default project ID, and enable necessary APIs, and grant access to Secret Manager and Cloud Storage.

Local debugging

  • Install huggingface_cli python package in your virtual environment.

  • Run huggingface-cli login to set up a Hugging Face credential.

  • Use the TGI Docker image to test your model locally. This allows you to iterate and debug your model locally before deploying it to Cloud Run.

Deployment to Cloud Run

  • Deploy the model to Cloud Run with NVIDIA L4 GPU: (Remember to update SERVICE_NAME).

Endpoint testing

  • Test your deployed model using curl

  • This sends a request to your Cloud Run service for a chat completion, demonstrating how to interact with the deployed model.

Cold start improvements with Cloud Storage FUSE

You’ll notice that it takes more than a minute during a cold start for the response to return. Can we do better? 

We can use Cloud Storage FUSE. Cloud Storage FUSE is an open-source tool that lets you mount Google Cloud Storage buckets as a file system.

First, you need to download the model files and upload them to the Cloud Storage bucket. (Remember to update GCS_BUCKET).

Now, we will create a new Cloud Run service using the deployment script as follows. (Remember to update BUCKET_NAME). You may also need to update the network and subnet name as well.

Next Steps

To learn more about Cloud Run with NVIDIA GPUs and to deploy your own open-source model from Hugging Face, check out these resources below:

Posted in
Read Entire Article