Introduction
Getir is the pioneer of ultrafast grocery delivery. Getir was founded in 2015 and revolutionized last-mile delivery with its grocery in-minutes delivery proposition. Today, Getir is a conglomerate incorporating nine verticals under the same brand.
Challenge
Getir uses Amazon Elastic Kubernetes Service (Amazon EKS) to host applications on AWS. One of the foremost challenges faced by Getir was the necessity to respond to dynamic scenarios faster and to scale up as quickly as possible in Kubernetes. Getir’s workload was spiky, as the demand instantly increased due to campaigns or events in different geographic locations. Furthermore, these rapid demand increases needed an immediate response before and after the event. In order to meet these agility requirements, Getir started to explore scaling options in Amazon EKS that aligned more closely with their fluctuating workload patterns.
Getir faced significant complexities in managing a multi-tenant Kubernetes architecture, where different teams and applications have varying requirements. Initially, using a single node group for multiple teams posed challenges in catering to diverse Amazon Elastic Compute Cloud (Amazon EC2) instance type requirements. As the platform team handled requests from various developer teams, the operational overhead of managing individual node groups for each team, such as upgrades, key distribution, and cost-efficient workload scaling, became a substantial concern. The default scaling mechanism, which treated all node groups as a single entity, introduced complexities that Getir found difficult to navigate and maintain in the long-run, as it limited their ability to efficiently meet the diverse application requirements in a scalable manner. Consequently, Getir was compelled to reevaluate their approach to avoid unnecessary operational burdens and streamline the configuration process.
When Getir was exploring cost-efficient options for their Amazon EKS workload, they started to think about using Amazon EC2 Spot Instances in their clusters. Up to this point, they were accustomed to using homogeneous instance types for Amazon EKS worker nodes and relied on the same instance type. With Amazon EC2 Spot Instances, diversification is important, as this gives Spot a better chance to find and allocate the required amount of compute capacity Getir needed for Amazon EKS worker nodes. This is done while making sure of a minimal amount of RAM and CPU in the cluster is maintained, and integrating these nodes with cluster auto scaling processes. The Getir team was looking for ways to simplify the operational burden of configuring and maintaining multiple compute options for the clusters.
Evaluation of Karpenter
As Getir was experiencing rapid growth, expanding into new regions, and launching new business verticals, there was an urgent need for a solution to their scaling challenge that did not require architecting from scratch. Karpenter was seen as a promising solution that aligned well with the evolving demands of Getir’s workloads. The appeal lay in its capacity to help Getir optimize resources, by scaling cluster nodes based on workload CPU and memory requirements. This is done instead of static node configurations, which may lead to over provisioning. Furthermore, Getir also appreciated its ability to help cost optimize by facilitating Spot usage with minimal operational overhead. The team found Karpenter’s integration with Spot to be efficient in two ways. The first was Integrated Spot interruption handling, and the second was its Amazon EC2 instance diversification simplification. These two features made Getir confident in moving forward with using Spot Instances with Karpenter.
Adoption strategy
In the early adoption phase, the team conducted thorough testing in their development environment, specifically with Spot Instances. Within two weeks, they observed a remarkable 60% compute cost reduction compared to previous periods, validating the effectiveness of their decision to integrate Karpenter and Spot into their cluster.
Getir established several performance targets to gauge the effectiveness of Karpenter in their cluster. First, they aimed to maximize Spot usage in development and staging environments by focusing specifically on the stateless and fault-tolerant workloads, which constituted 95% of the total. The second target centered around operational simplification. With Karpenter, Getir experienced a direct setup process, as Karpenter did not require any min-max-desired values or node sizing setup, nor homogeneous instance types. This improved the platform team’s efficiency and decreased the operational overhead by making the upgrade process and using Spot for their workloads easier. The third target emphasized speed. Through Getir’s benchmarking, it was observed that Karpenter provisioned instances 30% faster than before, particularly when faced with 3x higher internal pod and node numbers due to the unpredictable demand on the newly launched cities and new business verticals.
Getir’s usage of Karpenter in development and staging environments unfolded seamlessly, with teams smoothly transitioning as they used cluster autoscaler and Karpenter at the same time during this transition phase. They started by letting Karpenter handle 10% of their production workload and increasing this gradually by also implementing a taint strategy based on double labeling to avoid race conditions between two auto scalers. Moreover, in order to make sure of a smooth integration into production and mitigate unforeseen issues, Getir started to decrease min-max values in Auto Scaling groups gradually so that Karpenter can automatically scale the rest of the capacity required. When the team set min-max values to 0, Karpenter finally started to manage the capacity. This methodical progression allowed Getir to validate the reliability of Karpenter in a production setting, and by the fifth day, they successfully transitioned to provisioning 100% of instances through Karpenter.
Learnings from Karpenter adoption
The following two points were learned from adopting Karpenter.
Karpenter introduced a different perspective to Amazon EKS worker node patch management:
Getir wanted Karpenter to use Amazon EKS optimized Amazon Linux AMIs for nodes. This allowed Getir to use an Amazon Machine Image (AMI) that was pre-built, thus further reducing the operational burden for the team. However, Getir wanted to control the AMI selection rather than selecting the latest AMI. Therefore, Getir adopted Amazon EKS AMI Tester as part of their pipelines to make sure that new AMIs are validated before Karpenter uses them for node recycling and provisioning.
Karpenter consolidation provided more than the expected compute cost savings:
As consolidation mode is enabled by default, Karpenter works to continually assess if there is an opportunity for improvement of the efficiency of the cluster. If Karpenter identifies an alternative compute configuration, then nodes are replaced and workloads are consolidated. Getir found that consolidation resulted in nodes being replaced with lower-priced Amazon EC2 instances and the rest of the nodes were being used more efficiently, which differed from the original node compute profile.
Outcome
With the integration of Karpenter, Getir witnessed a substantial improvement in average instance efficiency, soaring from 30% to an impressive 85%. Prior to the integration of Karpenter, managing diverse and sizable workloads was a challenge for Getir, especially when it came to adapting to the system’s dynamically changing requirements with a single instance type. This limitation led them to maintain the clusters at full capacity during off-peak hours, instead of scaling them back. Consequently, they often encountered issues with excess capacity, resulting in inefficient resource usage. With Karpenter, they became more comfortable due to its flexibility by selecting appropriate instances based on the specific workload characteristics. This made sure that Getir’s system efficiently allocated capacity precisely where it was needed, thus eliminating the inefficiencies associated with excess capacity and optimizing resource usage.
Thanks to Karpenter, the teams at Getir now have the flexibility of using the right instance types according to their specific requirements. Previously, the system’s scalability was limited by this constraint, which depended on a predefined instance pool, restricting the adaptability needed for diverse workloads.
Karpenter facilitates minor patching for instances by assigning node expiration values, automatically terminating instances and deploying updated images. This automates security patching, reducing operational burden and enhancing system security through timely, automated updates. It streamlines processes, grants teams autonomy in defining container sizes, and improves overall efficiency.
The integration of Karpenter at Getir brought about substantial cost optimizations in their Amazon EC2 compute infrastructure. The unit compute cost for Amazon EKS worker nodes underwent a significant decrease of 48% by the increase of instance usages. This reduction was complemented by a 30% decrease in total instance working hours, showing the overall efficiency gains achieved through Karpenter’s implementation.
The cumulative result of these measures translated into a remarkable 60% reduction in Amazon EC2 compute costs, underscoring the financial benefits and efficiency gains realized by Getir through their strategic adoption of Karpenter.
Conclusion
In this post, we shared the challenges of spiky workloads, which require faster responses for dynamic scaling requirements. Karpenter’s impact on the infrastructure shows how effective it is in optimizing Amazon EC2 compute costs. This optimization was realized through more appropriate instance type selection, higher usage rates, and the enablement of Spot Instances. In order to find more details about Karpenter, see its official documentation page and watch this session from re:Invent 2023.