Apollo24|7: Migrating a complex microservices application to Google Cloud with zero downtime

11 months ago 45
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

Apollo 24|7 is a digital arm of Apollo Hospitals, a healthcare company headquartered in Chennai, India. It is one of India’s largest multidisciplinary healthcare systems that provides services such as online doctor consultations, home diagnostic tests, and online pharmacy, with a goal of making healthcare accessible and affordable to everyone.

In this blog, we explain how we migrated a critical application used 24x7 in the country, including 97 services and 40+ SQL databases to Google Cloud with zero downtime.

Accelerating growth with improved performance, security, and reliability

As a company that was launched right before the pandemic, Apollo 24|7 leadership faced a new set of growth challenges when COVID-19 hit. To keep up with demand, we needed blazing fast development across functional and non-functional areas. Apollo 24|7 has a mix of VMs and container-based deployments. The application includes various SQL and NoSQL databases, Redis, VMs, Kubernetes, load balancers, WAF, and integration with various third-party public endpoints.

To continue growth post-pandemic, our engineering leadership sensed an opportunity to take another look at our platform from a perspective of improving performance, security, and reliability. We wanted to introduce modern automation and Infrastructure as Code (IaC) interventions for improving robustness and to scale. Moreover, there was an increased focus on improving the overall cost efficiency to help provide affordable healthcare to the masses. With millions of active users on our application, we needed to ensure there was no downtime so that the user experience would not be impacted.

To ensure a smooth implementation of our requirements and needs, we worked closely with Searce, an award-winning Google Cloud partner. Cross-collaboration among Apollo 24|7, Searce and Google Cloud teams for development, testing, deployment, and monitoring throughout the project greatly contributed to the success of the migration project.

Searce worked alongside our teams as an extension, providing valuable expertise and support throughout the project.

A full-stack migration with zero downtime

The goal was to migrate the entire application to Google Cloud without any downtime. This included quality assurance (QA), pre-production and production environments.

Here’s a step-by-step breakdown of what we did:

  • Started with DORA survey, followed by a CAST assessment
  • Created migration waves, dependency graphs, and remediated vulnerable libraries
  • Established an enterprise-grade landing zone on Google Cloud and enabled security controls for posture management
  • Deployed VMs, GKE cluster, Cloud SQL, and Redis instances
  • Created CI/CD pipelines for all the services
  • Deployed services on Google Cloud and all its dependencies
  • Enabled ISV services via Google Cloud Marketplace
  • Set up change data capture (CDC) between source databases (DBs) to Cloud SQL
  • Split traffic between services to Google Cloud before doing the final cutover to Google Cloud

The QA environment was directly deployed on Google Cloud after creating CI/CD pipelines, Terraform scripts and various other necessary resources.

As we moved to next environments we used a pre-production environment that was a replica of the production environment but with lower capacity due to low traffic. We used it to simulate a production migration, which helped us create the right standard operating procedures (SOPs) for each service that is part of the application. We also created detailed dependencies of the services, split traffic between the old locations and Google Cloud, and tested the functionality end-to-end by sending 1-10% traffic to Google Cloud, depending on the traffic on the services. We set up CDC for databases between source and target on Google Cloud, and completed the cutover migration by shifting the entirety of the traffic to Google Cloud.

We learned many lessons from the pre-production environment, and applied them during the production migration. However, there were some challenges on the day of the migration due to Identity and Access Management (IAM) issues in a few services that were integrated with third-party services such as payments. During the partial deployment of the services on GKE, pods of the services were talking to DBs on the source location over the internet and/or VPN tunnel. As a result, bandwidth monitoring and NAT gateway port allocation needed to be monitored very closely to avoid packet drops. The benefit of using traffic splitting was that we could divert all traffic away from Google Cloud if any problems occurred. This helped us achieve a truly zero-downtime migration of the application.

Apollo247

The key highlights achieved from this migration are as below:

  • Migrated 97 services spanning three GKE clusters with both gateway and ingress controllers
  • Deployed 22 Redis instances, Cloud Functions, Pub/Sub, Kafka
  • Moved static content to Cloud CDN
  • Implemented Cloud Armor for WAF and DDoS, Security Command Center Premium
  • Selected Marketplace services including Aiven Managed Kafka, and MongoDB
  • Achieved zero downtime using production traffic splitting between the old location and Google Cloud
  • Migrated 40+ database instances(MySQL, PostgreSQL) to CloudSQL using CDC
  • Transferred 2+ TB of objects to Google Cloud

Improved agility and security, while reducing operational costs

Apollo 24|7 has successfully migrated our applications to Google Cloud, bringing about a number of benefits, such as improved performance and reduced latency when accessing GKE and Cloud SQL. By moving away from monolithic code, we could improve the architectural tenets, reducing costs by implementing committed use discounts (CUD) pricing and leveraging Google Cloud’s per-second billing, on-demand resources, and custom-sized VMs.

Adopting Infrastructure as Code (IaC) using Terraform helped us deploy services efficiently with a much lower scope for errors, for better agility and performance. The deployment also improved security as it identified gaps in existing IAM policies. This project helped us conduct a thorough cleanup based on Google’s security principles.

Way forward

Apollo 24|7 is also expanding the partnership on AI-powered solutions beyond the Clinical Decision Support System (CDSS) built in collaboration with Google Cloud Consulting AI teams. Currently, we are introducing Med-PaLM 2 generative AI models to power Ask Apollo and beyond.

Google and Apollo 24|7 are committed to continuing this transformative partnership for healthcare in India. Google Cloud has proven its value in Apollo 24|7's core applications, and we are now looking to expand usage to other areas such as marketing, supply chain, and customer service.

Read Entire Article