No GPU? No problem. localllm lets you develop gen AI apps on local CPUs

10 months ago 41
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

In today's fast-paced AI landscape, developers face numerous challenges when it comes to building applications that use large language models (LLMs). In particular, the scarcity of GPUs, which are traditionally required for running LLMs, poses a significant hurdle. In this post, we introduce you to a novel solution that allows developers to harness the power of LLMs locally on CPU and memory, right within Cloud Workstations, Google Cloud’s fully managed development environment. The models we use in this walkthrough are located on Hugging Face and are specifically in a repo from “The Bloke” and are compatible with the quantization method used to allow them to run on CPUs or low power GPUs. This innovative approach not only eliminates the need for GPUs but also opens up a world of possibilities for seamless and efficient application development. By using a combination of “quantized models,” Cloud Workstations, a new open-source tool named localllm, and generally available resources, you can develop AI-based applications on a well-equipped development workstation, leveraging existing processes and workflows.

Quantized models + Cloud Workstations == Productivity

Quantized models are AI models that have been optimized to run on local devices with limited computational resources. These models are designed to be more efficient in terms of memory usage and processing power, allowing them to run smoothly on devices such as smartphones, laptops, and other edge devices. In this case, we are running them on Cloud Workstations with ample available resources. Here are some great examples of why leveraging quantized models in your development loop may unblock your efforts:

  • Improved performance: Quantized models are optimized to perform computations using lower-precision data types such as 8-bit integers, instead of standard 32-bit floating-point numbers. This reduction in precision allows for faster computations and improved performance on devices with limited resources.

  • Reduced memory footprint: Quantization techniques help reduce the memory requirements of AI models. By representing weights and activations with fewer bits, the overall size of the model is reduced, making it easier to fit on devices with limited storage capacity. 

  • Faster inference: Quantized models can perform computations more quickly due to their reduced precision and smaller model size. This enables faster inference times, allowing AI applications to run more smoothly and responsively on local devices.

Combining quantized models with Cloud Workstations allows you to take advantage of the flexibility, scalability and cost effectiveness of Cloud Workstations. Moreover, the traditional approach of relying on remote servers or cloud-based GPU instances for LLM-based application development can introduce latency, security concerns, and dependency on third-party services. A solution that lets you leverage LLMs locally, within your Cloud Workstations, without compromising performance, security, or control over your data, can have a lot of benefits.

Introducing localllm

Today, we’re introducing  localllm, a set of tools and libraries that provides easy access to quantized models from HuggingFace through a command-line utility. localllm can be a game-changer for developers seeking to leverage LLMs without the constraints of GPU availability. This repository provides a comprehensive framework and tools to run LLMs locally on CPU and memory, right within the Google Cloud Workstation, using this method (though you can also run LLM models on your local machine or anywhere with sufficient CPU). By eliminating the dependency on GPUs, you can unlock the full potential of LLMs for your application development needs.

Key features and benefits

GPU-free LLM execution: localllm lets you execute LLMs on CPU and memory, removing the need for scarce GPU resources, so you can integrate LLMs into your application development workflows, without compromising performance or productivity.

Enhanced productivity: With localllm, you use LLMs directly within the Google Cloud ecosystem. This integration streamlines the development process, reducing the complexities associated with remote server setups or reliance on external services. Now, you can focus on building innovative applications without managing GPUs.

Cost efficiency: By leveraging localllm, you can significantly reduce infrastructure costs associated with GPU provisioning. The ability to run LLMs on CPU and memory within the Google Cloud environment lets you optimize resource utilization, resulting in cost savings and improved return on investment.

Improved data security: Running LLMs locally on CPU and memory helps keep sensitive data within your control. With localllm, you can mitigate the risks associated with data transfer and third-party access, enhancing data security and privacy.

Seamless integration with Google Cloud services: localllm integrates with various Google Cloud services, including data storage, machine learning APIs, or other Google Cloud services, so you can leverage the full potential of the Google Cloud ecosystem. 

Getting started with localllm

To get started with the localllm, visit the GitHub repository at https://github.com/googlecloudplatform/localllm. The repository provides detailed documentation, code samples, and step-by-step instructions to set up and utilize LLMs locally on CPU and memory within the Google Cloud environment. You can explore the repository, contribute to its development, and leverage its capabilities to enhance your application development workflows. 

Once you’ve cloned the repo locally, the following simple steps will run localllm with a quantized model of your choice from the HuggingFace repo “The Bloke,” then execute an initial sample prompt query. For example we are using Llama.

code_block <ListValue: [StructValue([('code', '# Install the tools\r\npip3 install openai\r\npip3 install ./llm-tool/.\r\n\r\n# Download and run a model\r\nllm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000\r\n\r\n# Try out a query\r\n./querylocal.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546730>)])]>

Creating a localllm-enabled Cloud Workstation

To get started with localllm and Cloud Workstations, you'll need a Google Cloud Project and to install the gcloud CLI. First, build a Cloud Workstations container that includes localllm, then use that as the basis for our developer workstation (which also comes pre-equipped with VS Code).

code_block <ListValue: [StructValue([('code', 'gcloud config set project $PROJECT_ID\r\n\r\n# Enable needed services\r\ngcloud services enable \\\r\n cloudbuild.googleapis.com \\\r\n workstations.googleapis.com \\\r\n container.googleapis.com \\\r\n containeranalysis.googleapis.com \\\r\n containerscanning.googleapis.com \\\r\n artifactregistry.googleapis.com\r\n\r\n# Create AR Docker repository\r\ngcloud artifacts repositories create localllm \\\r\n --location=us-central1 \\\r\n --repository-format=docker'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546100>)])]>

Next, submit a build of the Dockerfile, which also pushes the image to Artifact Registry.

code_block <ListValue: [StructValue([('code', 'gcloud builds submit .'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a5467f0>)])]>

The published image is named

code_block <ListValue: [StructValue([('code', 'us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546310>)])]>

The next step is to create and launch a workstation using our custom image. We suggest using a machine type of e2-standard-32 (32 vCPU, 16 core and 128 GB memory), an admittedly beefy machine.

The following example uses gcloud to configure a cluster, configuration and workstation using our custom base image with llm installed. Replace $CLUSTER with your desired cluster name, and the command below will create a new one (which takes ~20 minutes).

code_block <ListValue: [StructValue([('code', 'gcloud workstations clusters create $CLUSTER \\\r\n --region=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546070>)])]>

The next steps create the workstation, and starts it up. These steps will take ~10 minutes to run.

code_block <ListValue: [StructValue([('code', '# Create workstation configuration\r\ngcloud workstations configs create localllm-workstation \\\r\n --region=us-central1 \\\r\n --cluster=$CLUSTER \\\r\n --machine-type=e2-standard-32 \\\r\n --container-custom-image=us-central1-docker.pkg.dev/$PROJECT_ID/localllm/localllm\r\n\r\n# Create the workstation\r\ngcloud workstations create localllm-workstation \\\r\n --cluster=$CLUSTER \\\r\n --config=localllm-workstation \\\r\n --region=us-central1\r\n\r\n# Grant access to the default Cloud Workstation Service Account\r\ngcloud artifacts repositories add-iam-policy-binding \\\r\n localllm \\\r\n --location=us-central1 \\\r\n --member=serviceAccount:service-$PROJECT_NUM@gcp-sa-workstationsvm.iam.gserviceaccount.com \\\r\n --role=roles/artifactregistry.reader\r\n\r\n# Start the workstation\r\ngcloud workstations start localllm-workstation \\\r\n --cluster=$CLUSTER \\\r\n --config=localllm-workstation \\\r\n --region=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546550>)])]>

You can connect to the workstation using ssh (shown below), or interactively in the browser.

code_block <ListValue: [StructValue([('code', 'gcloud workstations ssh localllm-workstation \\\r\n --cluster=$CLUSTER \\\r\n --config=localllm-workstation \\\r\n --region=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546280>)])]>

After serving a model (via the llm run command with the port of your choice), you can interact with the model by visiting the live OpenAPI documentation page. You can apply this process to any model listed in the Bloke’s repo on HuggingFace Lllama was used in this scenario as an example. First, get the hostname of the workstation using:

code_block <ListValue: [StructValue([('code', 'gcloud workstations describe localllm-workstation \\\r\n --cluster=$CLUSTER \\\r\n --config=localllm-workstation \\\r\n --region=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3efa9a546c70>)])]>

Then, in the browser, visit https://$PORT-$HOSTNAME/docs.

Conclusion

localllm combined with Cloud Workstations revolutionizes AI-driven application development by letting you use LLMs locally on CPU and memory within the Google Cloud environment. By eliminating the need for GPUs, you can overcome the challenges posed by GPU scarcity and unlock the full potential of LLMs. With enhanced productivity, cost efficiency, and improved data security, localllm lets you build innovative applications with ease. Embrace the power of local LLMs and explore the possibilities within the Google Cloud ecosystem with localllm and Cloud Workstations today!

Read Entire Article