How BioNTech modernized Galaxy server deployments at scale using Amazon EKS

8 months ago 62

News Banner

Introduction

Galaxy is a scientific workflow, data integration, and digital preservation platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system, running on everything from academic mainframes to personal computers.

BioNTech has been using Galaxy for running experiments across research teams, and over time was adopted by a large number of scientists. With the growth, they were faced with the challenge of experiencing frequent computing bottlenecks. In addition, they were lacking a streamlined workflow for managing the application lifecycle end-to-end.

In this post, we demonstrate how AWS Professional Services helped BioNTech migrate and customize Galaxy on AWS to enable multiple bioinformatics research groups across the organization accelerate their research outcomes at scale and with predictable cost. The solution is using the Guidance for Galaxy Deployment on AWS — this extends the Amazon EKS Blueprints for CDK accelerator to deploy well-architected Amazon Elastic Kubernetes Service (EKS) clusters in a repeatable fashion. The AWS Cloud Development Kit (CDK) is an open-source software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. It accelerates cloud development using common programming languages to model your applications.

Solution overview

The following sections outline the main solution aspects, architectural decisions, and technological choices undertaken to modernize the Galaxy platform and make it available for secure and scalable usage across BioNTech.

Key drivers to modernize Galaxy

Galaxy is a powerful platform that allows computational biologists to build graphical user interfaces for command line applications. These range from simple tools for text manipulation to complex tools for evaluating, such as sequence data, imaging data, and a variety of visualizations. At BioNTech, several proprietary and open-source Galaxy tools are available that enable scientists to perform their daily operations, such as sequence homology searches, codon optimization, and oligo design.

The on-premises deployment of Galaxy used its built-in support to run on high-performance computers using workload managers such as Slurm for job scheduling. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

However, as the number of users grew, the solution introduced several challenges:

With the increased demand for compute and storage resources, the monolithic server was not able to meet the demand. This led to slower processing times and longer wait times for users.
The on-premises server did not have GPUs for increased demand on accelerated compute. This limited the types of analyses that could be performed on the server.
Scientific tools often have specific requirements on dependencies and library versions, making dependency management burdensome without the possibility to containerize tools. This made it difficult to make sure that the required dependencies were installed, up-to-date, and isolated between tools.
The overhead of managing the infrastructure and a plethora of manual steps in the process led to frequent downtimes and reduced productivity for users.

Container platform evaluation

The previously-mentioned challenges of the on-premises deployment led BioNTech to consider using a flexible, scalable, and declarative container orchestration tool to cater for a variety of workloads running on Galaxy. Another advantage is the ease with which common administrative tasks can be performed reliably and without disruption of service.

As such, BioNTech evaluated the deployment on Kubernetes using Galaxy Helm charts maintained by the Galaxy community. After evaluating various options, BioNTech opted to go with the Galaxy on AWS Guidance, which is based on official Helm charts and enforces the solution with fully managed cloud services.

Container platform architecture

The following architecture diagram depicts the core and peripheral infrastructure components used as part of the productive deployment. The clusters were initially provisioned using EKS Kubernetes version 1.24 and were updated to version 1.25.

Galaxy AWS architecture

The original infrastructure as code (CDK) of the Galaxy on AWS Guidance has been extended with the following resources as part of the AWS Professional Services engagement with BioNTech.

Amazon Simple Storage Service (S3): Storage of reference data (such as FASTA files) and binaries used by the custom developed Galaxy tools
Amazon Elastic Container Registry (ECR): Repository of container images for running custom tools
AWS Cloud9: Development and testing environment for custom tools
Amazon QuickSight: BI dashboards for visualizing Galaxy usage related metrics
AWS Certificate Manager: Provisioning and management of SSL/TLS server certificates attached on the Application Load Balancer (ALB)

Solution

Isolated networking

BioNTech has strict security policies, such as requirements for private connectivity. To achieve that, the Amazon Virtual Private Clouds (VPCs) in both development and production AWS accounts used for deploying Galaxy were configured without public internet connectivity. This resulted in the following configurations required for the solution to be deployable and usable across the organization from within the on-premises network.

Private EKS cluster: We provisioned fully-private EKS clusters by enabling the eks.default.private-cluster parameter in AWS CDK.
VPC endpoints: We created VPC interface endpoints in the clusters’ VPC private subnets for the AWS services to which Galaxy pods need access.
Kubernetes Helm Charts: We configured a virtual JFrog Artifactory repository to resolve Helm charts from various remote Helm chart repositories for Galaxy and add-ons extensions. Accessing binaries, artifacts, dependencies, and libraries in general through Artifactory and not directly through the source repository URLs was a security requirement of BioNTech.
Ingress annotations: We added annotations to Kubernetes ingress for customizing the ALB behavior with respect to scheme, subnet, and TLS certificate discovery.

User authentication

BioNTech’s IT security team requirement was to use the existing identity provider (IdP) for authenticating corporate users to the application. We implemented the natively supported Single-Sign-On experience in Galaxy that uses the OpenID Connect (OIDC) protocol to enable login to Galaxy without explicitly creating a Galaxy user.

Within Azure Active Directory we registered an application by supplying the redirect URLs with the destination BioNTech corporate domains. The OIDC related configuration files are dynamically generated during the deployment process in CDK based on the IdP information resolved from AWS Secrets Manager, to make sure that sensitive information is not used in plain text. Enabling this option enables OpenID and causes the OpenID form to be displayed on the login screen.

Custom tool development and deployment

BioNTech computational biologists rely on AWS Cloud9 instances to streamline the tool development process. These instances enable the direct building and testing of docker images within the AWS Cloud9 terminal, which are directly pushed to Amazon ECR. To make sure of optimal tool implementations, computational scientists can review the tools through Cloud9’s git integration before deploying them to Galaxy.

Thanks to the connectivity to Amazon EKS using kubectl, computational biologists can deploy changes to Galaxy without the need to switch their environment, thereby minimizing the amount of time and effort required to deploy new changes to Galaxy.

Job scheduling and autoscaling

To automatically adjust the size of the EKS clusters we provisioned the Cluster Autoscaler add-on. In total three managed node groups were deployed, each with different Amazon Elastic Compute Cloud (EC2) instance types aligned with the tool-specific compute requirements. To enable the Cluster Autoscaler to trigger scale-up events for scheduling pods when users submit jobs, we created a Pod Priority Class using a Kubernetes manifest and setting a higher value than the default cutoff (-10) that comes natively with the Galaxy deployment. Subsequently we passed this object to the job runner configuration.

Custom rules were configured using the TotalPerspectiveVortex (TPV) library to route tools to job destinations and to implement fine-grained control over jobs. Specifically each tool has its specific resource requirements, scheduling behavior, docker container overrides, and node selector defined.

To enable GPU-based workloads, such as ESMFold protein structure prediction, we deployed the NVIDIA GPU operator to automate the management of the NVIDIA software components needed to provision GPU.

Observability

Fluent Bit is a lightweight and extensible log processor for Kubernetes. It reads and processes logs from containers and enriches them with metadata. The Guidance is preconfigured to use Fluent Bit and send the Kubernetes container logs to Amazon CloudWatch and group them to improve observability.

In addition, by integrating Galaxy’s Amazon Aurora PostgreSQL database with Amazon QuickSight, system administrators can track and analyze various metrics. This includes the monitoring of storage consumption, hardware usage per tool, and identifying execution failures of individual tools. With this integration, administrators can quickly identify and pro-actively address potential issues with Galaxy tools.

Future improvements

BioNTech is evaluating the following improvements to make it easier, more flexible, and more cost-effective to run Galaxy on EKS clusters:

Improved compute management by using the advanced autoscaler Karpenter in Amazon EKS instead of default Kubernetes Cluster Autoscaler to run Galaxy tools: Karpenter is installed through Karpenter Add-on for EKS and can create EC2 Spot instances as Kubernetes nodes (with or without GPU support, depending on the requirements of each Galaxy tool), which temporarily exist only for the runtime of one or more Galaxy tools and offer significantly lower pricing as compared to on-demand EC2 instances.
Powerful DevOps Observability: usage of the AWS CDK Observability Accelerator for Amazon EKS to install and configure Amazon Managed Grafana and Amazon Managed Service for Prometheus as described in this post, for improved monitoring of the Galaxy components on Amazon EKS, such as to identify trends and potential areas of improvement on Grafana Dashboards.
More efficient and reliable method of copying files in Amazon S3 to Amazon Elastic File System (EFS) using AWS DataSync.

Conclusion

In this post, we described how BioNTech modernized its Galaxy server deployment and provided a testament to the successful collaboration between BioNTech and AWS, using several AWS services such as Amazon EKS, Amazon Aurora, and Amazon EFS. With the support of BioNTech leadership, we transformed an on-premises life science research platform into a modern, scalable, and secure platform on AWS and achieved tighter integration with the rest of the BioNTech cloud infrastructure.

Read Entire Article