Introduction
Freddie Mac has a mission to provide a stable US housing Market. Freddie Mac identified the need for faster application delivery, scalable performance, increased resiliency, and cost optimization of their existing application portfolio. To achieve their mission goals, Freddie Mac undertook a migration from on-premises Kubernetes to Amazon Elastic Kubernetes Service (Amazon EKS). As a result of this migration Freddie Mac saw an 80% reduction in application delivery time, from weeks to hours. Freddie also realized a disaster recovery (DR) return to operations (RTO) efficiency from hours to minutes by using the Amazon Web Services (AWS) DR mechanisms, auto-scaling, and applying automation on top of it. In this post we discuss the design and technical implementation, key challenges Freddie Mac overcame, lessons learned, and where Freddie Mac is looking forward in the future.
The Federal Home Loan Mortgage Corporation (Freddie Mac for short) was chartered by Congress in 1970 to keep money flow to mortgage lenders in support of homeownership and rental housing. Their mission is to provide liquidity, stability, and affordability to the US Housing Market. To complete this mission, Freddie Mac buys mortgages and sells them on the secondary market to private investors, increasing the supply of money for mortgage lending on home purchases. Freddie Mac also supports affordable housing initiatives such as the preservation of affordable homeownership and rental opportunities through multi-family lending. In 2022, Freddie Mac provided $614 billion dollars in liquidity to the housing market through more than 1,000 lenders. Freddie Mac financed housing for over 2.5 million families, including more than 1.8 million single-family units and 69,300 multifamily rental units.
Amazon EKS platform overview
The Freddie Mac platform team offers two distinct approaches for application teams to manage their deployments: multi-tenant and single-tenant clusters. In a multi-tenant cluster, multiple application teams share the same cluster and applications, which are isolated using separate namespaces. In a single-tenant cluster, each application team has their own cluster. Both options provide flexibility and scalability to accommodate various requirements and workloads. Application teams submit tickets to use a multi-tenant cluster or request a single tenant dedicated cluster. By offering both multi-tenant and single-tenant EKS clusters, Freddie Mac provides flexible options to support their application team’s diverse needs.
An executive sponsor of the migration was delighted with the benefits of Amazon EKS and shared the following quote: “My vision is to provide application teams a hardened, easy to use platform and Amazon EKS has done just that. It’s remarkable how Amazon EKS handles the complexity of running Kubernetes, allowing our team to focus on developing our applications rather than managing the infrastructure. With EKS, we’ve seen an 80% reduction in application delivery time.” said Anil Razdan, Senior Director at Freddie Mac.
Multi-tenant EKS clusters
In a multi-tenant EKS cluster setup, each app team is allocated a dedicated namespace within the shared cluster. This approach promotes resource efficiency by optimizing the use of the cluster’s compute capacity across teams. Here’s a breakdown of the key components provided to each app team:
Dedicated Namespace: Each app team operates within their own namespace, providing isolation and preventing interference with other teams.
S3 Bucket: Teams have access to Amazon Simple Storage Service (S3) buckets for storing and managing application data, and other artifacts.
Database Instance: Each team is given an instance of a managed relational database service, such as Amazon Aurora PostgreSQL. This database can be used to store and manage application data, making sure of data isolation and security.
Hashicorp Vault Integration: To securely manage secrets and sensitive information, each app team is integrated with Hashicorp Vault, providing a centralized secrets management solution.
Continuous Integration: For continuous delivery of applications, teams are provided with Spinnaker, an open-source, multi-cloud continuous delivery platform.
Security Groups and IAM Roles: Each component within the infrastructure is protected using AWS Security Groups, which controls traffic at the instance level, and AWS Identity and Access Management (IAM) roles are used to manage access to AWS resources.
Single-tenant EKS clusters
For teams managing commercial-off-the-shelf (COTS) products, a smaller, dedicated EKS cluster is provided. This approach is ideal for scenarios where teams need dedicated resources or isolated environments for their COTS products. Single-tenant clusters are a copy of multi-tenant clusters, but they differ from multi-tenant clusters in the following ways:
- Single-tenant clusters might or might not need databases, depending on COTS product requirements.
- Single tenant clusters only have a single dedicated S3 bucket and namespace.
Cluster provisioning and application pipeline
Freddie Mac decided to use their existing knowledge of AWS CloudFormation for the Amazon EKS platform to accelerate project delivery. Each EKS cluster is provisioned with the following default configuration using AWS CloudFormation templates:
Self-managed Node Group: To provide a baseline level of compute resources, the cluster is initially provisioned with five nodes in a self-managed nodegroup, making sure of a reliable foundation for running and scaling applications.
Auto-scaling Groups: To enable automatic scaling of resources based on demand, two auto-scaling groups (ASG) are configured within the AWS CloudFormation template. These groups dynamically adjust the number of nodes in response to changes in resource use. One ASG is used for the nodegroup for critical system level containers and another one is for the nodegroup for the application nodes. These nodegroups and ASGs work with the Cluster Autoscaler add-on.
GitOps for Add-ons: After the cluster(s) is provisioned, add-ons are installed using GitOps. Freddie uses a single ArgoCD cluster (hub) to install add-ons to the application clusters (spoke). Freddie uses ArgoCD as the GitOps tool. The high-level flow is shown in the following image. Some add-ons that Freddie uses are Cluster Autoscaler, metrics server, external DNS, filebeat, metricbeat, OPA, twistlock agents, etc.
Finally, applications are installed using a CI/CD pipeline as shown in the following image:
AMI refresh
Due to the regulatory compliance requirement of the Freddie Mac, additional software is installed on top of Amazon EKS-Optimized Amazon Machine Images (AMIs). Freddie developed an in-house solution, EARS (EKS AMI Refresh Service), to seamlessly refresh EKS clusters with the latest AMIs. Every time a new EKS-Optimized AMI is released, EARS runs a series of automated steps to create the new AMI and rehydrate the cluster. This service plays a vital role in maintaining compliance with Freddie Mac’s information security requirements and enhancing the overall security posture of the cluster.
EARS is built using AWS serverless technologies, making sure of a highly scalable, automated, and efficient solution.
AWS CodePipeline: This fully managed continuous delivery service is used to generate customized Amazon EKS AMIs. It automates the build, test, and deployment of the AMIs, making sure of a consistent and reliable image creation process.
AWS Lambda: Lambda functions periodically check for the latest available Amazon EKS AMIs and compare them with the AMIs currently deployed on the nodes within EKS clusters.
AMI Refresh Mechanism: When a Lambda function detects that a particular EKS cluster is using an AMI that is out of compliance or out-of-date, Lambda creates an Amazon Simple Queue Service (SQS) message and sends it to AWS Step Functions. Then, Step Functions triggers the Lambda, which refreshes nodes with the latest AMI. This mechanism makes sure that the nodes within the cluster are seamlessly and efficiently updated with the latest, most secure AMI version.
OCP migration to Amazon EKS
After the Amazon EKS platform was built, Freddie migrated on-premises Openshift Container Platform applications to the EKS cluster on AWS. This is done using a hybrid approach, where some data is required to stay in on-premises NAS while other applications are moved to the cloud. AWS Direct Connect is used to connect AWS to the on-premises data center to access the data. The platform team identified the gaps between the on-premises Kubernetes and Amazon EKS manifests and modified them to be deployed to Amazon EKS. The corporate data center operates Kubernetes clusters on Red Hat OpenShift (OCP), while Freddie Mac uses Amazon EKS in the cloud. The primary distinction between on-premises and cloud environments lies in the Kubernetes providers, on-premises data storage through NAS devices, and separate teams overseeing the platforms. The OCP platform team manages on-premises clusters, whereas the container orchestration platform team (COPS) manages Amazon EKS. The OCP team doesn’t use any standard CNI plug-ins to create PVs automatically. This entails the manual creation of PVs and PVCs through Kubernetes manifest files. Employing a lift-and-shift strategy, a similar approach was adopted in the AWS cloud. The team is currently in the process of automating PV provisioning by using standard CNI plugins available as add-ons in Amazon EKS. The following diagram illustrates this configuration:
Challenges and learnings
Every large transformation comes with its own set of challenges and learnings. The following are some of the challenges and how we solved them.
IP Address Exhaustion: By default, the Amazon Virtual Private Cloud (VPC) CNI plugin assigns each pod a routable IPv4 address from the VPC CIDR block. This enables network communication between resources for pod-to-pod on a single host, pod-to-pod on different hosts, pod to other AWS services, pod to an on-premises data center, and pod to the internet. When pods are assigned IP addresses from the VPC CIDR range, this often leads to exhaustion of the limited number of IP addresses available in their VPCs. In Amazon EKS, pods are treated as first class citizens, as in they have their own unique IP address and pod networking can be separate from the host network (CIDR that Amazon Elastic Compute Cloud (EC2) is in). Freddie Mac adopted custom pod networking to solve IP exhaustion challenges. Using custom pod networking, Freddie expanded the VPCs by adding additional secondary VPC CIDR ranges and pods are assigned IP addresses from this secondary CIDR rather than the primary CIDR that worker nodes are assigned.
Monthly AMI Refreshes: Amazon releases new AMIs with latest security patches for each Amazon EKS version, on average once a month. Freddie needed to trigger an automated process when a new AMI is released to harden it, and then upgrade the existing clusters. Some clusters needed to be updated on a schedule, and some needed to be upgraded on request from the application team. Freddie built an in-house AMI refresh process to solve this, which we described in the previous section. For users who are adopting Karpenter, this process is simplified using Drift. Refer to this Amazon Containers post for details
Selecting Cluster Tenancy: The AWS team and Freddie debated on whether to go with a large multi-tenant cluster or multiple single tenant ones. Both have tradeoffs. Having a single cluster generally means easier management of user authentication, Kubernetes version upgrades, cluster visibility, node management, and application deployments. But on the other hand, when using multi-tenancy, every tenant needs to be ready for upgrade on the same timeline, and additional features, such as network policy, are needed to isolate one app from another. One cluster per app brings in more flexibility for the application teams, as teams can have different upgrade timelines, security isolation is inherently present, and also cost separation is simpler. However, due to the numerous clusters for a user of Freddy’s size, multi-cluster visibility and management becomes a challenge for the platform team. Each user needs to decide based on their priority and requirements. Freddie decided on using a large multi-tenant cluster for the majority of the apps, and for certain apps, the platform team allowed single clusters.
Cost Optimization: Managing costs effectively is a crucial consideration for Freddie Amazon EKS implementation. Cost optimization is not a once and done process. The AWS team and Freddie look at cost optimization opportunities continuously. Some of the methods implemented at Freddie are as follows: enforcing assignment of pod requests and limits, rightsizing the Amazon EC2 nodes, reducing the log level from info to warning for certain applications, and using reserved instances. Freddie is also evaluating Karpenter to reduce cost further. For a comprehensive set of cost optimization techniques, refer to the Cost Optimization section in the Amazon EKS Best Practices Guide.
Scale the Learnings: After Freddie implemented the first couple projects to production on Amazon EKS they wanted to use those skills to build an enterprise-wide competency. This part is important because you don’t want other teams to go through the same mistakes and learnings that the Cloud Center of Excellence (CCOE) or the platform team went through. To scale the learnings, AWS and Freddie collaborated to provide the following:
- Create well-architected patterns to be consumed by other teams
- Create standard set of templates (such as dockerfile, jenkinsfile, helm charts) for app teams to use
- AWS-Freddie delivered periodic knowledge sharing sessions, where successful teams shared the best practices
- Continuous dialogue between AWS and Freddie teams where AWS shares the best practices for the Amazon EKS journey from similar industries
Conclusion and what’s next for Freddie Mac
In this post, we demonstrated how Freddie Mac built a platform on Amazon EKS that allowed them to migrate and modernize their apps at a faster pace. Freddie Mac is constantly pushing the boundaries, and currently exploring Karpenter as the Kubernetes autoscaler to help them further optimize cost and reduce data plane management overhead. They are also exploring an Internal Developer Platform (IDP) to further enable developer self-serve infrastructure. Freddie Mac and AWS continue to partner together to modernize and migrate regulated industry workloads on AWS. “The AWS partnership has made deploying with Amazon EKS not just easy but impactful, driving efficiency and innovation across our operations.”, says Pardeep Chahal, Senior Tech Lead of Platform Engineering, Freddie Mac.