Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

6 months ago 32
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Introduction

ComfyUI is an open-source node-based workflow solution for Stable Diffusion. It offers the following advantages:

  • Significant performance optimization for SDXL model inference
  • High customizability, allowing users granular control
  • Portable workflows that can be shared easily
  • Developer-friendly

Due to these advantages, ComfyUI is increasingly being used by artistic creators. In this post, we will introduce how to deploy ComfyUI on AWS elastically and efficiently.

Overview of solution

The solution is characterized by the following features:

  • Infrastructure as Code (IaC) deployment: We employ a minimalist approach to operations and maintenance. Using AWS Cloud Development Kit (AWS CDK) and Amazon Elastic Kubernetes Service (Amazon EKS) Blueprints, we manage the Amazon EKS clusters that host and run ComfyUI.
  • Dynamic scaling with Karpenter: Leveraging the capabilities of Karpenter, we customize node scaling strategies to meet business needs.
  • Cost savings with Amazon Spot Instances: We use Amazon Spot Instances to reduce the costs of GPU instances.
  • Optimized use of GPU instance store: By fully utilizing the instance store of GPU instances, we maximize performance for model loading and switching while minimizing the costs associated with model storage and transfer.
  • Direct image writing with Amazon Simple Storage Service (Amazon S3) CSI driver: Images generated are directly written to Amazon S3 using the S3 CSI driver, reducing storage costs.
  • Accelerated dynamic requests with Amazon CloudFront: To facilitate the use of the platform by art studios across different regions, we use Amazon CloudFront for faster dynamic request processing.
  • Serverless event-initiated model synchronization: When models are uploaded to or deleted from Amazon S3, serverless event initiations activate, syncing the model directory data across worker nodes.

Walkthrough

The solution’s architecture is structured into two distinct phases: the deployment phase and the user interaction phase.

Architecture for deploying stable diffusion on ComfyUI

Figure 1. Architecture for deploying stable diffusion on ComfyUI

Deployment phase

  1. Model storage in Amazon S3: ComfyUI’s models are stored in Amazon S3 for models, following the same directory structure as the native ComfyUI/models directory.
  2. GPU node initialization in Amazon EKS cluster: When GPU nodes in the EKS cluster are initiated, they format the local instance store and synchronize the models from Amazon S3 to the local instance store using user data scripts.
  3. Running ComfyUI pods in EKS: Pods operating ComfyUI effectively link the instance store directory on the node to the pod’s internal models directory, facilitating seamless model access and loading.
  4. Model sync with AWS Lambda: When models are uploaded to or deleted from Amazon S3, an AWS Lambda function synchronizes the models from S3 to the local instance store on all GPU nodes by using SSM commands.
  5. Output mapping to Amazon S3: Pods running ComfyUI map the ComfyUI/output directory to S3 for outputs with Persistent Volume Claim (PVC) methods.

User interaction phase

  1. Request routing: When a user request reaches the Amazon EKS pod through CloudFront t0 ALB, the pod first loads the model from the instance store.
  2. Post-inference image storage: After inference, the pod stores the image in the ComfyUI/output directory, which is directly written to Amazon S3 using the S3 CSI driver.
  3. Performance advantages of instance store: Thanks to the performance benefits of the instance store, the time taken for initial model loading and model switching is significantly reduced.

You can find the deployment code and detailed instructions in our GitHub samples library.

Image Generation

Once deployed, you can access and use the ComfyUI frontend directly through a browser by visiting the domain name of CloudFront or the domain name of Kubernetes Ingress.

Accessing ComfyUI through a browser

Figure 2. Accessing ComfyUI through a browser

You can also interact with ComfyUI by saving its workflow as an API-callable JSON file.

Accessing ComfyUI through an API

Figure 3. Accessing ComfyUI through an API

Deployment Instructions

Prerequisites

This solution assumes that you have already installed, deployed, and are familiar with the following tools:

Make sure that you have enough vCPU quota for G instances (at least 8 vCPU for a g5.2xl/g4dn.2x used in this guidance).

  1. Download the code, check out the branch, install rpm packages, and check the environment: git clone https://github.com/aws-samples/comfyui-on-eks ~/comfyui-on-eks cd ~/comfyui-on-eks && git checkout v0.2.0 npm install npm list cdk list
  2. Run npm list to ensure following packages are installed: git clone https://github.com/aws-samples/comfyui-on-eks ~/comfyui-on-eks cd ~/comfyui-on-eks && git checkout v0.2.0 npm install npm list cdk list
  3. Run cdk list to ensure the environment is all set, you will have following AWS CloudFormation stack to deploy: Comfyui-Cluster CloudFrontEntry LambdaModelsSync S3OutputsStorage ComfyuiEcrRepo

Deploy EKS Cluster

  1. Run the following command:
    cd ~/comfyui-on-eks && cdk deploy Comfyui-Cluster
  2. CloudFormation will create a stack named Comfyui-Cluster to deploy all the resources required for the EKS cluster. This process typically takes around 20 to 30 minutes to complete.
  3. Upon successful deployment, the CDK outputs will present a ConfigCommand. This command is used to update the configuration, enabling access to the EKS cluster via kubectl.

    ConfigCommand output screenshot

    Figure 4. ConfigCommand output screenshot

  4. Execute the ConfigCommand to authorize kubectl to access the EKS cluster.
  5. To verify that kubectl has been granted access to the EKS cluster, execute the following command:
    kubectl get svc

The deployment of the EKS cluster is complete. Note that EKS Blueprints has output KarpenterInstanceNodeRole, which is the role for the nodes managed by Karpenter. Record this role; it will be configured later.

Deploy an Amazon S3 bucket for storing models and set up AWS Lambda for dynamic model synchronization

  1. Run the following command:
    cd ~/comfyui-on-eks && cdk deploy LambdaModelsSync
  2. The LambdaModelsSync stack primarily creates the following resources:
    • S3 bucket: The S3 bucket is named following the format comfyui-models-{account_id}-{region}; it’s used to store ComfyUI models.
    • Lambda function, along with its associated role and event source: The Lambda function, named comfy-models-sync, is designed to initiate the synchronization of models from the S3 bucket to local storage on GPU instances whenever models are uploaded to or deleted from S3.
  3. Once the S3 for models and Lambda function are deployed, the S3 bucket will initially be empty. Execute the following command to initialize the S3 bucket and download the SDXL model for testing purposes. region="us-west-2" # Modify the region to your current region. cd ~/comfyui-on-eks/test/ && bash init_s3_for_models.sh $region

    There’s no need to wait for the model to finish downloading and uploading to S3. You can proceed with the following steps once you ensure the model is uploaded to S3 before starting the GPU nodes.

Deploy S3 bucket for storing images generated by ComfyUI.

Run the following command:
cd ~/comfyui-on-eks && cdk deploy S3OutputsStorage

The S3OutputsStorage stack creates an S3 bucket, named following the pattern comfyui-outputs-{account_id}-{region}, which is used to store images generated by ComfyUI.

Deploy ComfyUI workload

The ComfyUI workload is deployed through Kubernetes.

Build and push ComfyUI Docker image

  1. Run the following command, create an ECR repo for ComfyUI image:
    cd ~/comfyui-on-eks && cdk deploy ComfyuiEcrRepo
  2. Run the build_and_push.sh script on a machine where Docker has been successfully installed: region="us-west-2" # Modify the region to your current region. cd ~/comfyui-on-eks/comfyui_image/ && bash build_and_push.sh $region

    Note:

    • The Dockerfile uses a combination of git clone and git checkout to pin a specific version of ComfyUI. Modify this as needed.
    • The Dockerfile does not install customer nodes, these can be added as needed using the RUN command.
    • You only need to rebuild the image and replace it with the new version to update ComfyUI.

Deploy Karpenter for managing GPU instance scaling

Get the KarpenterInstanceNodeRole in previous section, run the following command to deploy Karpenter Provisioner:

KarpenterInstanceNodeRole="Comfyui-Cluster-ComfyuiClusterkarpenternoderole" # Modify the role to your own. sed -i "s/role: KarpenterInstanceNodeRole.*/role: $KarpenterInstanceNodeRole/g" comfyui-on-eks/manifests/Karpenter/karpenter_v1beta1.yaml kubectl apply -f comfyui-on-eks/manifests/Karpenter/karpenter_v1beta1.yaml

The KarpenterInstanceNodeRole acquired in previous section needs an additional S3 access permission to allow GPU nodes to sync files from S3. Run the following command:

KarpenterInstanceNodeRole="Comfyui-Cluster-ComfyuiClusterkarpenternoderole" # Modify the role to your own. aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name $KarpenterInstanceNodeRole

Deploy S3 PV and PVC to store generated images

Execute the following command to deploy the PV and PVC for S3 CSI:

region="us-west-2" # Modify the region to your current region. account=$(aws sts get-caller-identity --query Account --output text) sed -i "s/region .*/region $region/g" comfyui-on-eks/manifests/PersistentVolume/sd-outputs-s3.yaml sed -i "s/bucketName: .*/bucketName: comfyui-outputs-$account-$region/g" comfyui-on-eks/manifests/PersistentVolume/sd-outputs-s3.yaml kubectl apply -f comfyui-on-eks/manifests/PersistentVolume/sd-outputs-s3.yaml

Deploy EKS S3 CSI Driver

  1. Run the following command to add your AWS Identity and Access Management (IAM) principal to the EKS cluster: identity=$(aws sts get-caller-identity --query 'Arn' --output text --no-cli-pager) if [[ $identity == *"assumed-role"* ]]; then     role_name=$(echo $identity | cut -d'/' -f2)     account_id=$(echo $identity | cut -d':' -f5)     identity="arn:aws:iam::$account_id:role/$role_name" fi aws eks update-cluster-config --name Comfyui-Cluster --access-config authenticationMode=API_AND_CONFIG_MAP aws eks create-access-entry --cluster-name Comfyui-Cluster --principal-arn $identity --type STANDARD --username comfyui-user aws eks associate-access-policy --cluster-name Comfyui-Cluster --principal-arn $identity --access-scope type=cluster --policy-arn arn:aws:eks::
  2. Execute the following command to create a role and service account for the S3 CSI driver, enabling it to read and write to S3: region="us-west-2" # Modify the region to your current region. account=$(aws sts get-caller-identity --query Account --output text) ROLE_NAME=EKS-S3-CSI-DriverRole-$account-$region POLICY_ARN=arn:aws:iam::aws:policy/AmazonS3FullAccess eksctl create iamserviceaccount \     --name s3-csi-driver-sa \     --namespace kube-system \     --cluster Comfyui-Cluster \     --attach-policy-arn $POLICY_ARN \     --approve \     --role-name $ROLE_NAME \     --region $region
  3. Run the following command to install aws-mountpoint-s3-csi-driver Addon: region="us-west-2" # Modify the region to your current region. account=$(aws sts get-caller-identity --query Account --output text) eksctl create addon --name aws-mountpoint-s3-csi-driver --version v1.0.0-eksbuild.1 --cluster Comfyui-Cluster --service-account-role-arn "arn:aws:iam::${account}:role/EKS-S3-CSI-DriverRole-${account}-${region}" --force

Deploy ComfyUI deployment and service

  1. Run the following command to replace docker image: region="us-west-2" # Modify the region to your current region. account=$(aws sts get-caller-identity --query Account --output text) sed -i "s/image: .*/image: ${account}.dkr.ecr.${region}.amazonaws.com\/comfyui-images:latest/g" comfyui-on-eks/manifests/ComfyUI/comfyui_deployment.yaml
  2. Run the following command to deploy ComfyUI Deployment and Service:
    kubectl apply -f comfyui-on-eks/manifests/ComfyUI

Test ComfyUI on EKS

API Test

To test with an API, run the following command in the comfyui-on-eks/test directory:

ingress_address=$(kubectl get ingress|grep comfyui-ingress|awk '{print $4}') sed -i "s/SERVER_ADDRESS = .*/SERVER_ADDRESS = \"${ingress_address}\"/g" invoke_comfyui_api.py sed -i "s/HTTPS = .*/HTTPS = False/g" invoke_comfyui_api.py sed -i "s/SHOW_IMAGES = .*/SHOW_IMAGES = False/g" invoke_comfyui_api.py ./invoke_comfyui_api.py

Test with browser

  1. Run the following command to get the K8S ingress address:
    kubectl get ingress
  2. Access the ingress address through a web browser.

The deployment and testing of ComfyUI on EKS is now complete. Next we will connect the EKS cluster to CloudFront for edge acceleration.

Deploy CloudFront for edge acceleration (Optional)

Execute the following command in the comfyui-on-eks directory to connect the Kubernetes ingress to CloudFront:

cdk deploy CloudFrontEntry

After deployment completes, outputs will be printed, including the CloudFront URL CloudFrontEntry.cloudFrontEntryUrl. Refer to previous section for testing via the API or browser.

Cleaning up

Run the following command to delete all Kubernetes resources:

kubectl delete -f comfyui-on-eks/manifests/ComfyUI/ kubectl delete -f comfyui-on-eks/manifests/PersistentVolume/ kubectl delete -f comfyui-on-eks/manifests/Karpenter/

Run the following command to delete all deployed resources:

cdk destroy ComfyuiEcrRepo cdk destroy CloudFrontEntry cdk destroy S3OutputsStorage cdk destroy LambdaModelsSync cdk destroy Comfyui-Cluster

Conclusion

This article introduces a solution for deploying ComfyUI on EKS. By combining instance store and S3, it maximizes model loading and switching performance while reducing storage costs. It also automatically syncs models in a serverless way, leverages spot instances to lower GPU instance costs, and accelerates globally via CloudFront to meet the needs of geographically distributed art studios. The entire solution manages underlying infrastructure as code to minimize operational overhead.

Read Entire Article