Etsy’s Service Platform on Cloud Run cuts deployment time from days to under an hour

1 month ago 16
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Introduction

Etsy, a leading ecommerce marketplace for handmade, vintage, and unique items has a passion for delivering innovative and seamless experiences for customers. Like many fast growing companies, Etsy needed to scale their teams, technologies, and tools to keep pace with their business growth. Indeed, between 2012 and 2021, their gross merchandise sales increased over 1400% to $13.5 billion. 

As part of Etsy’s efforts to keep pace with this growth, the company migrated all their infrastructure from traditional data centers to Google Cloud. This shift not only marked a significant technological milestone, but also prompted Etsy to rethink its service development approach. The journey led to the creation of “ESP” (“Etsy’s Service Platform”), an Etsy-tailored service platform running on Google Cloud Run, which is a customized platform built on Google Cloud Run that streamlines the development, deployment, and management of microservices. 

This blog post will delve into Etsy’s experience building the service platform, how Cloud Run helped them accomplish their vision, highlight lessons learned, and share how their platform continues to evolve.

The need for change and architectural vision

As Etsy grew, so did the demand for our engineering organization to support richer functionality and higher traffic volume in our marketplace. Our migration to GCP in 2018 enabled Etsy engineers to explore and leverage Google Cloud based service platforms, however this explosion of technical creativity also gave rise to some new challenges, including duplicated scaffolding and code, and unsupported infrastructure with uncertain ownership.

To address these challenges, Etsy assembled a squad of architects to craft a vision detailing what future service development at Etsy would look like. The goal was clear: create a platform that decouples service writing from infrastructure, liberating developers from the burden of backend complexities and allowing them to quickly and safely deploy new services.  

Transforming vision into reality

The resulting architectural vision became the blueprint for ESP,  Etsy's Service Platform, and a newly formed squad was to take on the exciting challenge of transforming the Vision into reality. The first step was assembling a dynamic team capable of bridging the gap between infrastructure and application development. Comprising seasoned engineers with diverse expertise, the team brought a rich blend of skills to the table.

Recognizing the importance of aligning with our future platform customers, the team collaborated closely with Etsy architecture and engineering. The Ads Platform Team, already engaged in service development, played a pivotal role by agreeing to embed one of their senior engineers in the service platform team. Together, they delivered a Minimum Viable Platform (MVP) to support the deployment of a new Ads Platform service as the ESP pilot.

Choosing Cloud Run for accelerated development 

A successful service platform, according to our architectural vision, would streamline the developer experience by decoupling infrastructure and automating its provisioning. The team recognized that our potential customers from the larger engineering organization also needed a platform that integrated into their workflow with as little friction as possible. To achieve this, the service platform team chose to focus on Etsy-specific aspects: developer experience and language support, CI/CD, integration with existing services, observability, service catalog, security, and compliance.

The decision to leverage Google Cloud services, especially Cloud Run, was strategic. While alternatives like GKE were enticing, the team wanted to deliver value quickly. Cloud Run’s robust and intuitive design allowed the team to focus on core platform functionality, letting Cloud Run handle the more complex and time-consuming aspects of running containerized services.

The Toolbox: A Closer Look

To provide a consistent and efficient developer and operational experience, ESP relies on a carefully selected toolbox:

  • Developer Interface: A custom CLI tool for streamlined developer interactions.

  • Protocols: gRPC and protobuf for standardized communication.

  • Language Support: Go, Python, Node, PHP, Java, Scala.

  • CI/CD: GitHub Actions for a smooth integration and deployment pipeline.

  • Observability: Leveraging OTEL on Google Cloud services and Google Monitoring and Logging, along with Prometheus and AlertManager

  • Client Library: ESP generated clients are registered in Artifactory

  • Service Catalog: Utilizing Backstage for centralized service visibility.

  • Runtime: Cloud Run, chosen for its simplicity and compatibility.

Navigating Challenges

The path to creating the service platform encountered obstacles. The VPC connector experienced overloading, and some services required fine-tuning to optimize resource allocation. However, these challenges led to platform-level improvements that benefit future adopters.

ESP's design prioritized flexibility to accommodate our diverse technology landscape. While the team possessed expertise in various technologies, creating a one-size-fits-all platform supporting multiple service and client languages across diverse use cases was challenging. We decided to initially focus on a core feature set and add incremental capabilities and workarounds based on user feedback.

As ESP matured, valuable lessons shaped both day-to-day operations and its future evolution.

  • Sandbox Feature: A "sandbox" environment accelerated iteration, enabling developers to launch development versions of new services on Cloud Run in under five minutes, complete with CI/CD and observability.

  • Familiar Observability Tools: ESP integrated with our existing tools like promQL and Grafana, streamlining workflows for engineers.

  • Security Considerations: While ESP favored TLS and layer 7 authentication using Google IAM, collaboration with the Google Serverless Networking team ensured secure connectivity with our legacy applications.

  • Supporting AI/ML Innovation: During a company-wide hackathon, ESP's adaptability shone as a service interfacing with Google's Vertex AI was rapidly deployed.

  • Real-World Success: The Ads Platform service expanded to three additional systems as client support in more languages rolled out. Cloud Run's auto-scaling easily handled the increased load.

Conclusion and Future Outlook

ESP enables our engineers to be bold, fast, and safe, and is experiencing steady and  continued adoption throughout the organization. Customer requests for workloads beyond the serverless model have spurred collaboration with Google and our internal GKE team. The goal is to extend ESP's tooling to support an expanding class of services while maintaining a consistently high level of operational and developer experience.

The journey to pilot, challenges overcome, and future outlook highlight the dynamic and iterative nature of our service platform journey. ESP stands as a testament to our ability to adapt, innovate, and empower Etsy’s  engineering community to meet the ever-growing needs of our marketplace and business.

Posted in
Read Entire Article