This post was coauthored by Ben Duffield and Eric Silverberg at Perry Street Software, with contributions from Adam Tucker, Piotr Wald, and Cristian Constantinescu of PSS
Introduction
You just finished deploying that important change you spent weeks preparing, when you see this email subject in your inbox: Alarm: HTTPCode_Target_5XX_Count.
Ugh.
The code you have just deployed is causing a critical error in your application server. As dev ops professionals, we are told to just cut back over to the original code, and indeed this very sensible advice. But buried in that advice are two discrete, time-consuming steps: first, removing malfunctioning code from an Application Load Balancer, and second, adding previously functioning code back into the Application Load Balancer. Both steps take precious seconds or minutes, during which your customers are unable to use your application, and, in the worst scenarios, data is being lost or corrupted.
If you think about it, code deployments are like dating. Imagine a cluster of application servers: deploying new code to 1% is the equivalent of chatting online, 2% is coffee, 5% is drinks, 10% dinner, 50% is meeting the parents, and 100% is marriage. Deployment strategies like blue/green are effectively advocating for complete commitments, followed by sudden divorces, with your application server software versions. It’s a relationship strategy that can work for some people, but not without a lot of heartache.
Perry Street Software (PSS) is home to SCRUFF and Jack’d, two of the world’s largest gay, bi, trans, and queer social dating apps on iOS and Android. Perry Street’s brands reach over 30 million members worldwide so members can connect, meet, hook up and express themselves on a platform that prioritizes privacy and security. PSS is involved in internet policy at a national level, and has engaged with Senators, Congresspeople and policymakers on Section 230 and platform regulation to help defend LGBTQ+ digital spaces.
For its entire history, PSS has had a lean ops team that supports an active user base 24 hours a day, 365 days a year. Because of the global nature of the dating app business, downtime, even in the early hours of the morning, is not an option. Finding ops deployment strategies that are both simple to execute, simple to debug during failure, and mitigate worst-case scenarios are critically important.
History
PSS deployed its first Ruby server code in early 2010 using Capistrano, which is a remote server automation and deployment tool. Capistrano would make SSH connections, pull code from git, configure OS parameters and add or remove an Amazon Elastic Compute Cloud (Amazon EC2) instance from Elastic Load Balancers. Then and now, at a high-level PSS has three classes of servers: application servers, search servers and background workers. Each class of server has precise configuration requirements. As PSS’s infrastructure grew, their Capistrano scripts became increasingly complex for new developers to maintain and extend. Moreover, PSS lacked Continuous Integration / Continuous Deployment (CI/CD) pipelines, since these scripts needed to be executed on a bastion instance running inside of their Virtual Private Cloud (VPC).
Goals
By 2021, PSS original Capistrano deployment solution had served them well, but PSS knew they were overdue for an upgrade. PSS’s primary goal was a simplified code deployment process that would allow new developers to easily support their infrastructure. CI/CD systems had become the primary method by which modern software teams could achieve this goal, and PSS wanted the same for their system. Moreover, PSS wanted to containerize their production environment, given that they had previously done this for the development and test environments a number of years before and could therefore benefit from an existing base of Dockerfiles. At the same time, PSS didn’t want to give up on the benefits of Capistrano which, by virtue of its single-threaded, ssh-and-pull deployment strategy, implicitly provided themselves with an incremental deployment strategy, similar in practice to either linear or canary deploys.
For this project PSS only considered services fully integrated with AWS: Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS).
Amazon Elastic Kubernetes Service (Amazon EKS)
Kubernetes is a widely used open-source solution for automating containerized applications. It was developed by Google in 2014 and is the most popular container orchestration solution with nearly 100,000 stars on Github. Despite its popularity, PSS’s team elected not to proceed with Kubernetes. Their team had little Kubernetes knowledge, and thus little knowledge of the failure modes and patterns of Amazon EKS, which would make supporting it on a small team hard.
Kubernetes has extensive support for service configuration, but, when running in the cloud, many critical services are fully managed, e.g. Memcached (Amazon ElastiCache for Memcached), Redis (Amazon ElastiCache for Redis) and SQL databases (Amazon Relational Database Service, Amazon Aurora). For this reason, orchestration solutions running locally will never 100% reflect what is running in production. PSS uses LocalStack in development and test to replicate AWS managed services, and have found orchestration solutions like docker swarm to be a simpler option.
Amazon Elastic Container Service (ECS)
Amazon ECS is a fully managed container orchestration service that makes it easy for you to deploy, manage, and scale containerized applications. Historically, it has had a better integration with AWS services and better support by Amazon. PSS was an early user of Amazon ECS and already had tasks running in AWS Fargate. AWS Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. Ultimately, because PSS accepted there would be no consistency between production and test/development because of their use of managed services, they opted for a solution that would have the best possible support by AWS.
AWS Fargate versus Amazon EC2
When exploring their options, PSS needed to determine whether they should use Amazon ECS in Fargate mode or EC2 mode. PSS knew they had stable usage patterns throughout the day due to their global user base, so they did not need significant scaling capabilities.
Also critical to their evaluation was the fact that Amazon ECS with AWS Fargate did not support caching Docker images directly on the host itself, which led to slower startup times. PSS also identified that the cost discounts associated with all upfront pricing were significant, which ultimately led PSS to using Amazon ECS – EC2 mode.
Amazon ECS out of the box deployment strategies
Amazon ECS has two distinct strategies for deploying new Application Load Balancer webserver targets: blue/green deployments and rolling deployments.
Blue/Green deployments
In this approach a full replica of the running environment that equals or exceeds the current capacity is created. It is possible to configure a static percentage of traffic that is shifted to the new environment. Once the deployment is completed the old environment can be left running for a period of time to facilitate a rollback, but it is not possible to pause this at any stage.
This strategy does not support background workers as they do not register to a load balancer and read directly from an SQS queue.
Rolling deployments
Amazon ECS allows the configuration of a maximum and minimum healthy configuration allowing existing tasks to be replaced and new ones started until all tasks are replaced with the new version. It is possible to have an auto scaling capacity provider to launch new instances to have more that 100% capacity, this requires a new EC2 instance to be launched and configured adding 1 to 2 minutes to each deploy. This applies to both application servers and background workers.
Similar to Blue/green, this also does not support the ability to pause a deployment after initial traffic shift or have an extended canary – and will run to full completion unless a rollback is triggered.
Circuit breakers
It is possible to trigger rollbacks of a deployment by configuring a circuit breaker that will automatically rollback if a task fails to launch or fails to become stable. PSS’s application supports dozens of endpoints, any one of which could fail. Because of this, the definition of a stable deployment is hard to define – PSS could create a ping endpoint to confirm that it has started but not guarantee that all endpoints are behaving as expected.
Canary deployment requirements
PSS ships a monolith application that has over 100 endpoints, so health is a moving target that is mostly derived from a combination of sources such as Airbrake and telemetry from Amazon CloudWatch metrics: ALB, Database, Redis, Memcached. Depending on how frequently an endpoint is triggered, it can take seconds or minutes for an issue to be identified, some of which may only become noticeable under production load. There is often some element of developer discretion, especially when launching a new endpoint. PSS also lacks a staging environment that can replicate production load, including warm caches, capacity and the many combinations of client usage patterns.
To manage the risk and the impact of a flawed or breaking change to their users, PSS have used canary deploys since the very beginning, and did not want to abandon this strategy while modernizing.
Solution overview
Figure 1: Overview of CI/CD Pipeline with Amazon ECS
A new deployment is triggered via a commit in Github. That commit triggers an action that builds an Advanced RISC Machine (ARM) and Amd64 docker image. Because PSS has three different types of APIs (application servers, search servers, and background workers), the upload of the image definitions file triggers three pipelines which all perform the same four steps:
- A manual approval
- Deploy to a service that has a capacity of 1
- A second manual approval, and
- A full deploy
Walkthrough
Step 1: Manual approval
AWS CodePipeline allows PSS to include a Manual Approval action provider. This also enables PSS to avoid accidental pushes to production by careless or malicious developers.
Step 2: Deploy to low-capacity Amazon ECS service
This is the crux of the canary deployment. By creating an Amazon ECS service with a capacity of 1, or 10% for their background workers, PSS is able to achieve the benefits of their old canary deployment system in the context of Amazon ECS. If there is a problem, only a limited number of application containers are affected. This limited-capacity service is registered to a load balancer that also has a full-capacity service registered to it.
Step 3: Manual approval
PSS has a second Manual Approval action provider that allows them time to verify the health of the deployment by reviewing CloudWatch Dashboards, Airbrake and APM tooling (Datadog and New Relic). After reviewing various metrics across their toolchain, and when things are going as planned, PSS proceed to step 4 where they deploy their application to a high-capacity Amazon ECS service.
Step 4: Deploy to high-capacity Amazon ECS service
The final step is to deploy to their Amazon ECS service that has full capacity, or the remaining 90% for their background workers. PSS do use the rolling-deploy capabilities of Amazon ECS, and turnover 10% of their containers at a time.
Notes on PSS’s implementation
Tag immutability
PSS uses tag immutability to ensure that the expected code is running when containers exit in-between releases. If you tag images as latest, rollback would be harder, because you would have to rebuild your old image or re-tag your old image as latest.
As a benefit of not using latest, PSS have the Amazon ECS agent configured to only pull the shared image once: (ECS_IMAGE_PULL_BEHAVIOR: once). This speeds up task launching for services with hundreds of containers.
Task definitions and Terraform
Task definitions in terraform state track a specific version of a task definition (not latest). Deployments via AWS CodePipeline will create a new version of the task definition with the new Amazon ECR tagged image and update the Amazon ECS service to use this new version. This causes the Terraform state and the infrastructure state to diverge and become hard to track actual configuration modifications to the task definitions.
There is no best practice for supporting this use case so PSS’s task definitions reference the sha of their release branch in Github and they use a script to re-import terraform state configuration changes. You can see this script at this link.
Future enhancements
PSS has many ideas for how to improve the system. Other than incorporating canary deployments natively into Amazon ECS, PSS would love to see better support for queue workers. Because queue workers read directly from an Amazon Simple Queue Service (Amazon SQS) queue, and new workers are being launched every few seconds, it would be ideal to configure what percentage of new workers are launched using the new image vs old image in a given time interval.
Currently no good image cleanup solution exists. Amazon Elastic Container Registry (Amazon ECR) allows retention rules for fixed periods of time, but it is possible that some images can be in use for longer periods of time. There is no way to specify both an age and in use flag for images. A custom solution would need to identify either images that are in use from an Amazon ECS agent or check recent task definitions.
Lastly, rollbacks aren’t smooth. To achieve a rollback, you have to go into each of the services manually and click the old task definition to revert. Given all of the services that PSS actually run, they have 10 different services they may need to manually roll back. In most cases it is best practice to add a new commit and continue to deploy forward, though some changes have a significant impact and rollback is the quicker and best solution.
Assessing impact of PSS’s move to CI/CD
Comparing Capistrano to AWS CodeBuild buildspecs
PSS’s old Capistrano code would deregister Virtual Machines (VMs) out of load balancers, use SSH and checkout code, and re-register with load balancers. The Capistrano code had no conception of what or how these machines were configured, and AMI configuration was a manually documented process that would be updated every few years. The Capistrano logic was also many hundreds of lines split across files. Capistrano would be run by a bastion instance with limited per-user permission capabilities.
Today PSS use AWS CodeBuild buildspecs to build their docker images. These buildspecs are approximately 50 lines. They use Terraform to manage their infrastructure, which has effectively replaced their wiki documentation for building an image by hand. Deployments are able to take advantage of AWS Identity and Access Management (AWS IAM) based roles and policies. It’s also easier for new employees to do a deployment because all steps are point-and-click versus through an SSH terminal.
Deployment times: faster for new capacity, equivalent for each release
Today, when PSS need to deploy new capacity, it takes 2 to 3 minutes, all automated, to launch and register a new Amazon EC2 instance. Once registered, it takes another 2 to 3 minutes to pull images and start containers. Every deployment however requires us to build an Amazon ECR image, as PSS are building multi architecture images, this can add an additional 5 minutes to the deployment times.
Previously this would be a manual process in multiple AWS windows to launch and configure an Amazon ECS image, and adding new capacity would take >10 minutes and could be error-prone.
Consistency across environments
PSS is using the same Dockerfile now to build images for Development, Test, and Production. When doing future software upgrades, PSS can do them with more confidence that they have not introduced unanticipated regressions, and can spend less time supporting bespoke configurations for dev and test environments.
Conclusion
In this post, it showed how PSS simplified the code deployment process while maintaining their deployment workflow and not slowing down deployment times. PSS have better security controls in place that reduce access to running instances and have enabled themselves to containerize their production environment, which simplified the migration to more efficient and cost effective AWS Graviton instance types.