Introduction
Generative AI is not only transforming the way businesses function but also accelerating the pace of innovation within the broader AI field. This transformative force is redefining how businesses use technology, equipping them with capabilities to create human-like text, images, code, and audio, which were once considered beyond reach. Generative AI offers a range of applications that extend beyond simply executing prompts and facilitating interactive conversations. These models are increasingly used in diverse scenarios such as code generation, content summarization, data analysis, and more. Their adoption by enterprises across various industries highlights the versatility and utility of LLMs in addressing a multitude of complex tasks and challenges. However, these advantages pose a new set of challenges, particularly in the realms of training and operationalizing these massive models.
The escalating scale of Large Language Models (LLMs) and Generative AI greatly increases computational demands, leading to higher costs associated with the development and deployment. As the scale of data and the complexity of these models grow, so too does the need for more substantial resources to train them efficiently. This trend underscores the importance of cost-effective solutions like Amazon Elastic Kubernetes Service (Amazon EKS), which provides the necessary scalability and computational power to manage these extensive training workloads without incurring prohibitive expenses. According to projections by TIRIAS Research, AI infrastructure costs could surpass $76 billion by 2028. The existing business frameworks find it challenging to transfer these growing costs to consumers, necessitating either the advent of new business models or a substantial reduction in costs to ensure the continued growth and affordability of GenAI.
Amidst the rising costs and the increasingly scarce global compute supply, AWS Trainium offers a practical solution for model developers facing these challenges. By using AWS Trainium, developers can reduce the cost of training their models by up to 50%, while also optimizing performance in distributed training use cases. This makes AWS Trainium a valuable asset for those looking to manage expenses and improve efficiency in the realm of deep learning and model development. For more detailed information about AWS Trainium and its capabilities, you can visit the AWS Trainium product page for in-depth insights.
Distributed training architecture with AWS Trainium and Amazon EKS
The solution builds on a Data on Amazon EKS Terraform-based blueprint, which allows users to easily provision an Amazon EKS cluster along with a managed EKS nodegroup containing Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. Each trn1.32xlarge instance has 16 AWS Trainium chips, which can be used for scalable, high-performance, and cost-effective model training. Within the nodegroup, the Trn1 instances are connected via high-speed, low-latency elastic fabric adapter (EFA) networking to enable the collective communications required during distributed training.
Each Llama training job is executed via Kubernetes pods using a container image that includes the Neuron SDK (the software stack for Trn1 instances) and the AWS Neuron Reference for NeMo Megatron – a fork of the open-source packages NeMo and Apex that have been adapted for use with OpenXLA and AWS Neuron. The combined software stack provides advanced training strategies and features including data, tensor, pipeline and sequence parallelism, selective activation checkpointing, and ZeRO-1 optimizer sharding.
The Kubernetes MPI Operator is used to coordinate distributed training across multiple pods, where each worker pod runs on a single trn1.32xlarge instance.
An Amazon FSx for Lustre shared filesystem is attached to the worker pods, providing a shared location to store the dataset, tokenizer files, Llama training scripts, training logs, compilation artifacts, and model checkpoints.
Solution Overview
Training Llama2 using AWS Trainium on Amazon EKS
Note: This post makes use of Meta’s Llama tokenizer, which is protected by a user license that must be accepted before the tokenizer files can be downloaded. Please ensure that you have access to the Llama files by requesting access here.
Prerequisites:
- Amazon EC2 Instance (x86) or AWS Cloud9 instance (x86) – for both, please ensure you have at least 100GB+ of storage
- AWS Command Line Interface (AWS CLI) v2
- kubectl
- Git (Only for Amazon EC2 instance); The AWS Cloud9 comes with Git installed by default
- Docker
- Terraform
- Python, pip, jq, unzip
To install all the prerequisites on Amazon EC2, you can run this script.
Step 1: Clone the data on EKS repository
Navigate to trainium-inferentia directory.
By default MPI operator is not installed and its set to false.
For this post, we will run the below export command to set environment variables.
NOTE: As of January1, 2024 AWS Trainium instances are only available in us-west-2, us-east-1, and us-east-2 Regions.
Step 2: Run the install script to provision an Amazon EKS cluster with all the add-ons needed for the solution.
Note: Before you run the script, you can also change the cluster name based on your naming requirements.
Step 3: Get access to Amazon EKS cluster as we will perform the following steps.
Step 4: Navigate to examples/llama2 directory
Run the 1-llama2-neuronx-pretrain-build-image.sh script to build the neuronx-nemo-megatron container image and push the image into Amazon ECR.
When prompted for a Region, enter the Region in which you launched your Amazon EKS cluster (Step 1).
Note: The image building and pushing to Amazon ECR will approximately take ~10 minutes.
Step 5: Access the shared Amazon FSx stoage.
To copy files to this storage, we’ll first launch and connect to a CLI pod running the neuronx-nemo-megatron docker image that you created previously.
Run the following script to launch the CLI pod:
Run the following command to see the CLI pod going into ‘Running’ state:
Step 6: Once the CLI pod is ‘Running’, connect to it using the following command:
From the CLI pod, we’ll download the Llama tokenizer files. First, run the huggingface-cli login command to login to Hugging Face using your access token.
The access token is found under Settings → Access Tokens on the Hugging Face website.
Paste the access token and hit enter.
Note: Do not add the token as a Git credential
Step 7: Download the llama7-7b tokenizer files to /shared/llama7b_tokenizer by running the python code.
Step 8: Download and tokenize the RedPajama-Data-1T-Sample dataset (a small subset of the full RedPajama dataset that contains 1B tokens).
While still connected to the CLI pod, use Git to download the dataset
Step 9: Tokenize the dataset using the preprocessing script included with neuronx-nemo-megatron. This preprocessing step will take approximately 60 minutes to run on a trn1.32xl instance.
As you can see, 930500 documents are processed as part of data tokenization process.
Note: When we later launch our training jobs in Amazon EKS, the training pods will run the training script from within neuronx-nemo-megatron/nemo/examples directory on FSx. This is convenient, because it will let you modify your training script directly on Amazon FSx without requiring that you rebuild the neuronx-nemo-megatron container for every change.
Step 10: Modify the test_llama script /shared/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/test_llama.sh to update the following two lines. These lines tell the training pod workers where to find the Llama tokenizer and the dataset on the Amazon FSx filesystem.
Run:
Before:
After:
Step 11: When you are finished with the CLI pod you can delete it by running:
Step 12: We are now ready to launch our pre-compilation and training jobs!
Before we can run the training job, we first need to run a pre-compilation job in order to prepare the model artifacts. This step extracts and compiles the underlying compute graphs for the Llama2-7B model and generates AWS Neuron executable files (NEFFs) that can run on the AWS Trainium chips. These NEFFs are stored in a persistent AWS Neuron cache on Amazon FSx so that the training job can later access them.
Before you run the compilation job make sure MPI operator is functional by running this command:
Run the pre-compilation script:
Pre-compilation will take approximately 10 minutes when using 4 trn1.32xlarge nodes.
Periodically run kubectl get pods | grep compile and wait until you see that the compile job shows Completed.
Step 13: When pre-compilation is complete, you can then launch the pre-training job on 4 trn1.32xl nodes by running the following script:
Step 14: To monitor the training job output, first, find the name of the launcher pod associated with your training job:
Once you have identified the name of the launcher pod and see that it is Running, the next step is to determine its UID.
Replace test-mpi-train-launcher-xxx with your launcher pod name in the following command and it will output the UID:
Step 15: Use the UID to determine the log path so you can tail the training logs. Replace UID with the previous value.
When you are done viewing the logs, you can press CTRL-C to quit the tail command.
Step 16: To monitor AWS Trainium chip utilization you can use the neuron-top command.
Neuron-top is a console-based tool for monitoring AWS Neuron and system-related performance metrics on trn1/inf2/inf1 instances. You can launch neuron-top on one of the worker pods as follows:
Step 17: Create a Tensorboard deployment to visualize these logs by running the following command:
Tensorboard logs are also available in the /shared/nemo_experiments/ directory on the Amazon FSx for Lustre filesystem. Once the deployment is ready the script will output a password-protected URL for your new Tensorboard deployment.
Load balancer URL will display as output after running the shell script. Open the loadbalancer URL in the browser to view your training progress:
Cleaning up
To clean up all the provisioned resources for this post, run the cleanup script:
Conclusion
In this post we showed you how AWS Trainium’s integration with Neuronx-nemo-megatron on Amazon EKS marks a significant stride in tackling the rising computational demands and cost challenges in training advanced AI models. Notably, AWS Trainium offers up to 50% cost savings in training, coupled with high-performance capabilities. This, along with the Neuron SDK’s compatibility with popular machine learning (ML) frameworks, establishes an optimal environment for AI model training. The inclusion of the MPI Operator and Data on Amazon EKS (DoEKS) further enhances the efficiency and scalability of distributed training. Innovative features like ZeRO-1 optimizer sharding and selective activation checkpointing not only make cutting-edge ML research more accessible but also drive the AI industry towards unprecedented innovation.
Key links & references
AWS Trainium
AWS Neuron Reference for NeMo Megatron