Saving AWS Expenses by Optimizing Halodoc Infrastructure

aws Nov 27, 2020

This blog will detail how Halodoc reduced its AWS bill by thousands of dollars ~ 20% of monthly bill over multiple quarters there by saving significant amounts in annualized bills. We will provide an overview of this cost-optimization journey based on our own experience handling AWS costs and share some strategies that worked well for us. One of the best ways of reducing the AWS costs is to get access to the billing console and start looking at the monthly invoice. Review each of the line items and start from those services which are costing the most.
Based on our review, we identified the following list of services for cost-optimization.

Elastic Compute Cloud
Relational Database Services
Transfer Family
CloudFront Discounted Pricing

Now we will discuss in detail the strategies used for the above mentioned services for cost saving.

Elastic Compute Cloud

Migrate EC2 workloads to new gen families: New generation EC2 workloads provide better performance and cost savings compared to older generation EC2 workloads. Here is the cost comparison of different generations workloads.

Note: We have considered only large instances of three mainly used instance families at Halodoc. Let’s suppose if you are having 50 instances for each family then just migrating to newer generation, we can save around (50 * 8.20 + 50 * 3.66 + 50 * 24.30) $1808 per month by just migrating to newer generation.

Utilizing Spot instances: To save more cost we are using 100% spot instances in our Stage and Preprod environments. For high availability and handling frequent termination of the spot instances we are using spot fleet (multiple instance types in same ASG) which will provision highly available resources from the fleet and EC2 spot termination handler to automatically drain and create new resources on some other nodes. Using spot instances on stage and preprod, we are able to reduce the cost by 10% per month.
Terminate all stopped EC2: Terminate all stopped instances. Just stopping the instance does not save from the cost of attached EBS volume and Elastic IP.
Clean-up of old AMIs: Regularly cleanup old AMIs from your AWS account. To clean-up, we sorted the AMIs based on the date of creation, AMIs which were older than a year. We cleaned by running the shell script and for automating the AMI cleanup we wrote a lambda function in python, which will keep the retention period of max three days for all the production workloads. And for most critical servers we are taking hourly backups, but the retention period is the same.
Right sized EC2 instances: For right sizing, we captured the 15 days data of resource utilization and based on the utilization data we recommended the right size for the servers and downgraded as per the recommendations.
Stop noncritical EC2 instances: Stop the instances which are not required to run or serve 24*7: Yes, every product-based IT company will have testers for performance testing which require giant EC2 instances with the high configurations and to avoid this useless availability of these giant servers we have written an ansible script which will trigger an alert and stop those giant instances after office hours. Sometimes based on office workload they might need to do the testing even after office hours so in those scenarios, we have given them a Jenkins job which will help to start and stop the instances as required.
Unused LoadBalancers: Cleaned all unused load-balancers, for this we wrote a shell script to get the list of LBs which does not have any listeners attached to it. Then we analyzed the data and discussed with the team and proceeded to cleaning unused load-balancers.

Transfer Family

Let's discuss the billing structure of AWS Transfer Family Service. With the AWS Transfer Family, you not only pay for the protocols you have enabled for access to your endpoint but also for the amount of data transferred over each of the protocols. You select the protocols, identity provider, and endpoint configuration to enable transfers over the chosen protocols. You are billed on an hourly basis for each of the protocols enabled to access your endpoint, until the time you delete it. You are also billed based on the amount of data (Gigabytes) uploaded and downloaded over each of the protocols.

Simple strategy to save the cost of multiple SFTP endpoints is having a common SFTP endpoint which will have multiple domain aliases for different partners.

Here is the architecture before implementing the cost optimization strategy:

Here is the architecture after implementing cost optimization strategy. We migrated different SFTP endpoints into a common SFTP endpoint. Which resulted in significant cost savings. By implementing we have saved more than couple of thousand dollars per month.

Relational Database Service

Stopped non-prod RDS instances which were not required to run 24*7. To achieve this, we wrote an automated script in python which stops these RDS instances based on tags associated with the RDS instances and scheduled jobs to shutdown the RDS instance beyond business hours. We also enabled our dev teams to start these instances on demand.
There were some RDS instances running with the multiple replicas in multi-AZ mode which was not required as these RDS instances had multiple replicas already in different zones.
We right-sized the RDS instances based on the last 6 months utilization data and query load which helped us to right-size the RDS instances.
We have deleted the older snapshots which were created long back and not required anymore. To delete these snapshots, we tagged snapshots based on their requirement and wrote an automated script to delete the snapshots based on tags.

CloudFront Discounted Pricing

Go for CloudFront discounted pricing plan which can save significant CloudFront costs. To enable discounted pricing plan, you need to contact AWS where you have to commit certain minimum traffic per month (based on usage). This is similar to the cost saving plan for Reserved EC2 instances.

Conclusion

Maintaining a balance between performance and price is one of the biggest challenges in planning, building and maintaining infrastructure. With the above mentioned strategies, we reduced our AWS Bill by 20%.

Join us

We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Recommended for you

Capacity Planning

Structured Capacity Planning and Automation to Optimise Java Services

4 months ago • 9 min read

web automation

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG

4 months ago • 4 min read

NewRelic custom integration for MySQL RDS

Halodoc's Journey to ISO/IEC 27001:2022 - Key Steps in Adopting New Security Requirements

Optimizing Apache Hudi Workflows: Automation for Clustering, Resizing & Concurrency

Upgrading to Dropwizard 4.0.x: A Complete Migration Guide

Saving AWS Expenses by Optimizing Halodoc Infrastructure

Tags

Kailash Singh Adhikari

Recommended for you

Structured Capacity Planning and Automation to Optimise Java Services

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG

NewRelic custom integration for MySQL RDS

Halodoc's Journey to ISO/IEC 27001:2022 - Key Steps in Adopting New Security Requirements

Optimizing Apache Hudi Workflows: Automation for Clustering, Resizing & Concurrency

Upgrading to Dropwizard 4.0.x: A Complete Migration Guide

Tags

Kailash Singh Adhikari

Recommended for you

Securing the Cloud : How Cloud Security Platforms Handle Threats and Misconfigurations In Real… Time

Structured Capacity Planning and Automation to Optimise Java Services

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG