Cordial’s journey implementing Bottlerocket and Karpenter in Amazon EKS

5 months ago 47
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Overview

Cordial is a cross-channel marketing platform that offers tools to fully automate marketing strategies. By automating marketing execution, Cordial liberates technology teams to focus on their core strengths: building and creativity. It empowers technology teams to delegate data access and management to marketers, using Cordial’s robust platform to migrate, transform, and deliver complex data solutions. Designed with a customer-centric approach, Cordial offers a fast, flexible, and interconnected suite of artificial intelligence (AI)-powered tools and services that use data to bolster creativity, boost efficiency, cut costs, and drive profitable growth.

Since its launch in 2014, enterprise brands have chosen Cordial to automate high-conversion messages – across email, SMS, mobile app, social media, direct mail, and more – driving scalable revenue and lasting customer connections. Cordial was named as the fastest-growing company in the 2022 and 2023 Deloitte Technology Fast 500TM and has earned several awards for technology, growth, and culture.

To sustain this exponential growth and prepare for future expansion, Cordial embarked on a journey of building their next generation infrastructure platform on Amazon Elastic Kubernetes Service (Amazon EKS) as a core compute platform, whereas the earlier platform was on Amazon Elastic Compute Cloud (Amazon EC2). In this post, we dive into Cordial’s journey in implementing Bottlerocket as Operating System (OS) and Karpenter as Node Lifecycle Manager within their Amazon EKS environments to achieve operational efficiency and improve security posture.

Opportunity for improved security posture and operational efficiency

Before Amazon EKS, we had our application running in Amazon EC2 with multiple Autoscaling Groups facilitating the scaling process. Being a cross channel marketing platform, we had two key tenets for the application behavior:

  • Client messages should start going out immediately.
    • Whenever a client triggers a message delivery, we need to send it immediately with no delay.
  • Client messages can be triggered at any time.
    • The real-time nature of service would mean messages can be sent at any time without a predictable pattern.

As clients expanded, we encountered several operational efficiency and security posture limitations, such as the following:

  • Slower application boot up times
  • Compute scaling issues
  • Inability to use cost-efficient Instance Purchasing options such as Spot Instances
  • Inefficient debugging workflow

Challenge-1: Slower boot times and compute scaling issues

Our first big issue was how long it took for nodes to boot and start the application. We used user data scripts on our instances to install and configure the application environment at boot. Automated tasks performed within user-data shell scripts ranged from common configuration requirements such as formatting Amazon Elastic Block Store (EBS) volumes, installing application level dependencies, and bootstrapping the application environment with specific runtime parameters.

Because of this, it would take up to five minutes beyond the standard Amazon EC2 autoscaling provisioning process before the application could start accepting client requests. This slow startup process prevented us from adding capacity in real-time based on workload demands, and this delay was directly impacting our instant message delivery goals.

To make sure client requests are met instantly, we needed to maintain peak capacity at all times, despite some of the capacity being idle, which led to operationally inefficient design.

Challenge-2: Inability to use Spot Instances

As a result of our node start up process being 5-10 minutes, unfortunately we were also not able to safely handle the two minute Spot Instance Interruption warning notice while maintaining quality of service. This made usage of Spot Compute impossible, leading to cost inefficiency.

Challenge-3: Improved Security Story

Within our solution, we were using General Purpose OS with Center for Internet Security (CIS) benchmark checks to maintain compliance goals. As we needed to plan a solution for future growth, Container Optimized OS with minimal packages, reduced attack surface and out-of-the-box hardening is something we needed to consider to improve our security story.

Challenge-4: Improved debugging workflow

Our debugging workflow involves logging to worker nodes, installing the strace utility to monitor and debug a process on the underlying node, which is a security posture risk from an access standpoint. Therefore, making sure that debug packages are not installed by default at boot up time became a requirement from a security standpoint. This also increased operational overhead as it needed continuous monitoring.

Solution overview

To address the aforementioned challenges, we embarked on a multi-phased journey.

In Phase-1, we started containerizing messaging components so that dependencies are reduced and scalability issues can be addressed.

In Phase-2, we focused on Amazon EKS as the core compute platform for necessary management, scaling, and deployment of containerized applications. Initially we started with the standard setup using managed node groups to provision worker nodes and Cluster Autoscaler to handle the necessary scaling process. We had some immediate wins as a result of this phase, around slower boot up times stemming from the removal of the user-data script, as the majority of the configuration was configurable either at the Kubelet level or at the Container level. This reduced our application start up process significantly from the initial 5-10 minute wait time. As a result of the faster boot up process, we also started implementing spot capacity for certain processes within the workload.

Managing this setup quickly became more complex and time consuming than we had hoped. As we started onboarding different clients, our clusters needed multiple instance types, sizes, and architectures to accommodate unique requirements and spot pool diversity. Cluster Autoscaler posed some challenges as it required Instance Types used within the node group to have the same amount of RAM and CPU cores for accurate scaling calculations. This limited us from creating node groups that have mixed capacity such as Compute Optimized and General Purpose Instance Types. Furthermore, this also limited us in creating node groups with varied instance sizes and architectures (x86 and ARM) within the same Autoscaling Group. Although Cluster Autoscaler does provide priority scaling for us to define on-demand as a fall back option with spot compute as priority, the controller took too much time.

In Phase-3, we addressed scalability challenges by implementing Karpenter, an open-source node lifecycle management project built for Kubernetes. NodePool Custom Resource Definition within Karpenter solved requirements around multiple instance types, sizes, and architectures. Karpenter’s interaction with EC2 Fleet API of the type “instant” to provision nodes helped us achieve a node startup time of 30 seconds or less. Karpenter’s inherent disruption control flow, in conjunction with Disruption budgets, also helped us consolidate unused capacity, thereby leading to a cost-efficient design.

In Phase-4, to improve the security posture, we started implementing Bottlerocket, a Linux-based open source OS purpose-built by AWS for running containers.

  • Many of the packages, tools, interpreters, and dependencies installed by default in general purpose Linux distributions are simply not needed to only host containers. Bottlerocket improved security overhead by excluding these extraneous pieces of software.
  • Bottlerocket follows an API-centric design and with Report API, and it provided the mechanism to automate OS-level reporting. With Bottlerocket Report API, we were able to run reports against Bottlerocket as well as Kubernetes CIS benchmarks.
  • Within our solution until Phase-3, we had to use the sed (stream-editor) to modify Kubelet’s JSON configuration file in place, resulting in an error prone workflow. However, Bottlerocket provided us with a declarative TOML configuration format to configure Kubernetes settings, making it easier and less complex.
  • With inherent SELinux enforcing mode configuration and having all unprivileged containers get the restrictive container_t label, besides not providing secure-shell access, Bottlerocket helped us accomplish the necessary monitoring framework around developer debugging in a secure manner alongside existing mechanisms.
 Cordial’s Amazon EKS architecture

Figure 1: Cordial’s Amazon EKS architecture

Next steps and achieved outcomes

After implementing the preceding design, we also started improving our pod scaling process by making sure that Horizontal Pod Autoscaler uses Custom Metrics, as shown in the following graphs

Pod Count

Pod Count

Tasks in Queue per Pod

Tasks in Queue per Pod

By using Amazon EKS, Karpenter, and Bottlerocket, we are able to size our clusters based on the work they need to do in the moment and adjust up and down as needed. With Karpenter’s rapid fast node provisioning times, diverse compute options, and node consolidation, we are no longer running at peak capacity at all times, thereby achieving operational efficiency and nearly 18% cost savings and an 80% startup time improvement on just the compute part of the solution. Bottlerocket’s purpose-built nature and API-centric design made sure we have a security posture level up, and together with Karpenter we have a solution that’s well-architected for future scaling.

Conclusion

In this post, we discussed how the Cordial team improved user experience by implementing Amazon EKS, Karpenter, and Bottlerocket. The collaboration between AWS and Cordial was essential for the successful modernization of Cordial’s application environment. We also discussed how Cordial is focused on further optimizing the environment as appropriate to enhance business value.

Read Entire Article