How Thomson Reuters achieved 5X operational efficiency and 30% cost optimization with Plexus on Amazon EKS

5 months ago 33
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Introduction

In today’s dynamic business landscape, operational efficiency and cost optimization are two critical ingredients for a successful business outcome. This is especially true for companies navigating through digital transformation. In 2020, Thomson Reuters (TR), a leader at the intersection of content and technology with trusted data, committed to a cloud-first strategy with Amazon Web Services (AWS). Once this migration was complete, the Thomson Reuters team continued to work with AWS to further modernize the migrated applications, improve operability, and reduce overall cloud costs. At this point, TR felt a need for a platform that can not only help them to run modernized applications at scale but also provide the required fine-tuning capabilities to control cost and operational efficiency through customization. In this post, we explain how TR developed a homegrown platform, Plexus Service Mesh (Plexus) to run modern applications on Amazon Elastic Kubernetes Service (Amazon EKS) and over-achieved their cost optimization goal by 30% and with 5X operational efficiency.

Overview of Plexus

Plexus emerged as the linchpin of TR’s digital transformation. This is a homegrown platform tailored to TR specific needs and governance requirements, which is a must to operate as a leading content provider in the world. Plexus is a TR managed Shared Capability platform. It is built with Cloud Native container and service mesh technologies on top of Amazon EKS. The Plexus platform aims to ease the burden of cloud infrastructure provision and application deployment, thus growing the velocity of application development at TR.

Key features and capabilities

Plexus goes beyond conventional container orchestration solutions. The robust features of this platform encompass compute and storage management, service discovery and traffic management, observation and monitoring, as well as security and governance, providing TR flexibility and control through customization.

Architecture

Before implementation, TR conducted a thorough analysis of their needs. The decision to adopt Plexus on Amazon EKS was the result of careful consideration, aligning the platform with business objectives.

 Plexus node management

Figure 1: Plexus node management

Plexus leverages Amazon EKS to simplify Kubernetes management and extends Amazon EKS with EKS add-ons, open source add-ons, and in-house developed add-ons to fulfill cluster management requirements in an automated fashion. This makes sure that the application teams onboarded to Plexus are free from these responsibilities and focus on the feature development that matters to their users. In 2023, Plexus completed the transition from cluster-autoscaler to Karpenter, which provides flexible Amazon Elastic Compute Cloud (Amazon EC2) nodes management and optimization. The preceding diagram gives a high-level view of Plexus node management through Karpenter. It also showcases how, with the help of Karpenter, Plexus can scale any instance type within minutes and make sure the workloads get the right compute resources (CPU, memory, storage, GPU, etc.).

Operational efficiency: The 5X leap

In 2019, Plexus started by forming alliances with two application teams who recognized the advantages of having a centralized group to oversee the Kubernetes infrastructure, rather than each team creating this capability independently. After these initial teams experienced the cost savings and performance improvements offered by Plexus, their success prompted additional teams to join. Currently, Plexus provides support to over 30 teams within TR. These teams now endorse the benefits of leveraging Plexus, showing how they were able to overcome the initial obstacles with a dedicated team and a strategic approach. The implementation of Plexus resulted in a paradigm shift in operational efficiency at TR.

Before and after scenarios

Before the introduction of Plexus, application teams in TR faced longer timelines for their product launch, often starting with 2-3 months of cloud infrastructure setup. Now, with Plexus in place, applications can be deployed in a production like environment within a matter of days, reflecting a significant improvement of time-to-market. Furthermore, these groups have noticed significant enhancements in the performance of their applications and reliability.

Specific operational improvements

Plexus continuously raised the bar on how TR’s IT operation operates. Prior to Plexus, application teams in TR who wanted to start their Kubernetes journey had to build everything from the scratch. Teams struggled in deciding the right Kubernetes add-ons for their application, they had issues in identifying the service ingress and egress management, storage management, and monitoring. Teams also struggled to build solutions adhering to TR networking and security standards. Post Plexus, application teams can focus on the user-facing application development, delivering required features or enhancing its functionalities and leaving the platform-specific undifferentiated heavy lifting to Plexus. Plexus brought a platform-ops approach and saved teams’ months of effort in research. It also brought down the cost for teams by bringing a shared platform capability for the Kubernetes workload. Moreover, the Plexus team became a knowledge hub in TR for Kubernetes and service mesh needs, bringing in cost optimization and performance improvements for the teams that have onboarded the platform.

Cost optimization: A 30% reduction

Beyond operational efficiency, Plexus delivered substantial cost savings. For TR’s workloads running on Amazon EKS, the goal was to maximize the availability and performance while being an efficient resource. The earlier version of Plexus was built on the Kubernetes cluster-autoscaler (CA). The CA enables some dynamic resource provisioning and cost optimization but has limitations. Node groups must use EC2 instances with similar compute and memory to avoid issues. It simulates scheduling based only on the first instance type defined, risking over-provisioning resources and lacks in resource consolidation after pod termination. Besides this, multiple homogeneous node groups are needed for various workloads. Although it removes underused nodes, it does not right-size instances in response to workload changes. Prioritizing spot over on-demand for stateless workloads was also challenging. In summary, the CA provided the basic auto-scaling functionality, but due to its restrictive design, it limits the applicability for production uses-cases.

Plexus needed a more efficient way to provision diverse workloads without multiple node group overhead. The open-source tool Karpenter addressed this with flexible, high-performance groupless autoscaling. Karpenter dynamically provisions worker capacity based on pod requirements, without restrictive instance type similarity constraints. It evaluates the aggregate resource needs and constraints of pending pods, and then launches optimal compute instances (Amazon EC2). This allows TR teams to define configurations tailored to their apps and scaling needs. Additionally, Karpenter directly leverages the Amazon EC2 fleet API for faster provisioning without node groups or autoscaling groups, enhancing performance service level agreements (SLAs) with millisecond retry times as opposed to minutes. Running the Karpenter controller on AWS Fargate also eliminates managed node groups entirely. In summary, Karpenter delivered the groupless autoscaling flexibility, performance, and efficiency TR sought for diverse workloads, without node group restrictions.

Due to the inflexibility of CA, Plexus started with a single instance type: c5.2xlarge. The following is a cost report of a Plexus cluster, running 51 c5.2xlarge on-demand instances, supporting 656 pods. CPU allocation was only at 62% due to fragmentation created during the pods movement. As indicated in the report, it costs us $12,658.20 each month to keep this cluster running.

Now, with Karpenter, it can replace the c5.2xlarge instances with mixed types of t, m, c, and r of xlarge, 2xlarge, and 4xlarge instances. All pods, 552 due to the reduction of DaemonSet pods, are spread across 38 nodes. As the following report shows, the compute cost dropped significantly from $12658.20 to $8722.04, a 30% savings. With mixed instances, CPU allocation was increased from 62.0% to 96.5%, which means a huge waste reduction of compute capacities, which is better for TR and better for the earth.

It’s worth mentioning that with mixed instance types, it also reduces the risk of hitting the “AWS running out of capacity” issue, as we are now selecting compute instances from a larger pool. Karpenter’s consolidation feature helps maintain high CPU allocation by monitoring use to check if workloads can run on other nodes or be replaced by cheaper variants, evaluating factors such as pod counts, node expiry, and Pod Disruption Budgets. In summary, Karpenter delivered savings of more than 30% with optimization and consolidation capabilities for TR’s stateful and stateless workloads.

Future roadmap

Having tasted success, TR is eager to explore new possibilities and maximize the potential of Plexus.

Leveraging Plexus further

TR outlines plans to leverage Plexus in modernization and Data-Analytics initiatives, demonstrating the platform’s versatility. Plexus is witnessing a growth of at least 50% YOY for new application onboardings. The success with Amazon EKS has inspired TR to explore more Generative AI models to be deployed on EKS clusters. Plexus plans to support over 40+ applications and achieve a 20% growth in EKS clusters by the end of 2024. By the end of 2025, they aim to onboard 60+ applications, resulting in a 20% growth in EKS clusters.

Generative AI and Plexus

 Proven benefits and reliability make Plexus an obvious choice to run generative AI workloads at scale in production. The content modernization team is using Plexus as a core façade layer to generate contexts for the similarity searches. This helps generative AI applications build using Retrieval Augmented Generation (RAG) architecture. The Plexus team is also working with multiple application teams to deploy and run their future generative AI use-cases and vector databases workload in production. Plexus is also bringing support for GPU-based instances, which help TR to onboard AI workloads.

Enhancements and updates to Plexus

As technology evolves, so does Plexus. Plexus is working on a new version that can automate the application onboarding and updates. Its design is intended to accommodate hundreds of clusters where application teams self-serve their onboarding process by leveraging GitOps, which at this point is carried out by Plexus operation teams. These improvements help modernize their migrated applications at an accelerated rate and keep Plexus at the forefront of their digital strategy.

Conclusion

In conclusion, TR’s success with Plexus on Amazon EKS bolster the transformative power of Kubernetes for a large enterprise’s strategic digital initiatives. Plexus built on Amazon EKS has proven to be a catalyst for operational efficiency and cost optimization, showcasing the potential of homegrown platforms in the era of digital transformation. As businesses continue to evolve, the lessons learned from TR’s journey offer valuable insights for users seeking to carve their path to success.

Read Entire Article