At Halodoc, we continuously strive to optimize the performance and reduce the cost of our compute infrastructure. We heard a lot about new AWS custom-built ARM64 based processors called “Graviton2”. Online article reviews suggested that Graviton2 based instances offer 40% better performance at 20% lower cost compared to Intel/AMD based processors.
This blog explains about migrating Kubernetes workloads from AMD64 nodes to Graviton2 nodes. The learnings of our migration journey can facilitate others to plan better and complete the migration faster and successfully.
Our compute nodes were Intel/AMD based and significant number of nodes were used in Kubernetes clusters (EKS). So, we targeted migrating AMD64 based workloads into Graviton2 nodes first.
Within our Kubernetes cluster, we have various types of applications i.e. Dropwizard, NodeJS, Tomcat and GO applications. We focused on Dropwizard applications to migrate first, as majority of applications are Dropwizard based.
We thought that just building ARM64 based images can enable our migration to Graviton2. We tried building GO applications for Graviton2, but did not succeed due to build issues. Build issues were due to missing compatible libraries.
We tried building Dropwizard applications using “docker buildx”, but not able to run the applications on Graviton2. This was due to incompatible architecture.
We observed that auto scaling of Graviton2 nodes was not happening as per resource requirements.
We were not able to see container metrics with existing NewRelic infra-agent for Kubernetes.
Challenges faced to start the migration journey:
- Based on our initial analysis, we thought that creating multi-arch image with “docker buildx” can enable easy migration and no modification required in CI/CD process (we use Gitlab, Jenkins, Ansible, helm-charts, Kubernetes for CI/CD)
In reality, we found that, image built on AMD64 architecture does not work on Graviton2 nodes.
We evaluated two options:
a. Using AWS CodePipeline to build Graviton2 based images and push to ECR (Elastic Container Registry)
b. Create a Graviton2 Jenkins slave.
We decided to go with Graviton2 Jenkins slave option since it requires minimal changes in our CI/CD process.
2. Second challenge faced was that we use vault-k8s agent as Init Container for managing the secrets. We could not find ARM64 based image for vault-k8s agent. Before vault integration, we used shell script for this purpose. No intention to go back to old-school.
We solved the issue using Vault 1.6.2 version which is multi-arch based and with an annotation.
- Created base image for vault init agent for arm64: vault V1.6.2
- Used vault agent proxy annotation in our helm charts.
3. Third challenge we faced was that we use NewRelic for our APM. Unfortunately, no ARM64 supported NewRelic agent available . If we move our workloads into Graviton2, we will not get any APM metrics, which will impact our Monitoring and Alerting system.
Initially we used, newrelic-infra agent installed as Linux process in EKS worker nodes to provide container related metrics. But no luck. It does not capture the container metrics. Only node infra metrics are available.
Luckily, we got a beta image for ARM64. This solved our monitoring and alerting issues. We maintained two daemonsets i.e. one for AMD64 and one for ARM64 nodes.
Our Kubernetes cluster is running with version 1.16 before migration. We upgraded to version 1.17. During the process, we upgraded ALB ingress controller, Metrics server and Cluster auto-scaler to much higher level to support multi-arch nodes. Otherwise, these component upgrades are not mandatory.
We tested the waters in stage first. Once validated, we migrated the workloads into Prod. We used self-managed node groups in our EKS cluster. Created a new Auto Scaling Group for Graviton2 nodes with different node label to facilitate easy migration.
Our Java apps are Dropwizard based. We created base images for ARM64. An application pod contains these containers (Alpha is app container)
2. Updation of helm charts.
We updated the images according to ARM64 architecture. We added annotation and vault proxy in our helm charts
Updated node Selector to force the new deployment placed on Graviton2 nodes.
We used fluent-bit for log collection.
3. Restricting Graviton2 based Jenkins jobs to run on Graviton2 node only.
4. We observed test case failures in a few applications mainly for missing compatible libraries. We took developers help to find alternate libraries to fix those issues.
Validation of Stage migration:
Our approach was to validate the applications in stage before migrating to production since we wanted to avoid any surprises in production. Our testers performed through testing of the application before providing sign-off. We withheld some application migrations due to test case failures ( we use Sonarqube and JaCoCo for test coverage). We worked with developers and testers to find alternate libraries to get through the quality gates.
During application migrations, we found that CPU requirements are little higher compared to AMD64 based worker nodes while starting the application. We increased the CPU "limit" around 10% where app container continuously restarts for CPU crunch. Over a period of time, CPU utilisation reduces.
We observed deployment failures because of running wrong Jenkins jobs and scheduling of ARM64 based applications on AMD64 nodes.
To avoid hurdles in developer velocity, we followed an approach that, the minute an application is migrated successfully in stage, we focussed on migrating that application to production rather than focusing on other applications in stage.
We have seen good performance improvements after moving into Graviton2. Here are a few examples:
1.On AMD64 worker nodes:
On Gravtion2 worker nodes:
Observation: There is performance improvement in the service when moved to Gravtion2. The average web transaction time has been decreased by 46%.
2. On AMD64 worker nodes:
On Gravtion2 worker nodes:
Observation: Graph shows performance improvement by moving into Gravtion2. /v2/coupons api response improve by 48%, /v1/coupons api response time improved by 56%.
3.On AMD64 worker nodes:
On Gravtion2 worker nodes:
Observation: The average web transaction time has been decreased by 48%
In this article, we have explained how we migrated workloads from AMD64 based to Graviton2 based nodes. That enabled better price-to-performance.
We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at firstname.lastname@example.org
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.