At Halodoc, we have been using Airflow as a scheduling tool since 2019. Airflow plays a key role in our data platform, most of our data consumption and orchestration is scheduled using it. We leverage Airflow to schedule over 350 DAG’s and 2500 tasks and as the business grows, we are continuously adding or orchestrating new data sources and new DAGs are added to the Airflow Server.
Current Airflow Setup at Halodoc
The Airflow Cluster and its components (WebServer, Scheduler and Worker) are hosted in EC2 instances and is managed by Data Engineers and Site Reliability Engineers (SRE) collaboratively. Airflow Cluster Architecture is described on our previous blog here. And, how we leverage Aiflow on our Data Platfom is described here.
Challenges with the current setup(self-managed cluster)
- Cluster Maintenance
a) Version upgrade
Airflow community is very active and they keep on releasing new features and bug fixes within a short span of time. Upgrading the Airflow version is a cumbersome process. Every upgrade consumes a lot of engineering effort from upgrading to testing the new version with all the library dependency checks and so on.
Hosting Airflow in EC2 instances needs critical attention in terms of security from routing the networks to opening the port for the applications. Sometimes even we need to perform patching of the Python versions which is vulnerable to security.
c) Deploying or adding nodes in the cluster
When we need to add or delete any new node from the cluster, we need to manually provision and add a new instance to the existing cluster. This creates a dependency on SRE to scale the cluster at any time.
d) Configuration management
With a self-managed EC2 cluster, any changes in the config need to be done in every EC2 instance. This is a cumbersome process if the number of nodes starts increasing. If any config update is missed or wrongly reconfigured in any of the nodes, then the cluster or the DAGS might behave abnormally and be difficult to debug.
Currently the self-managed cluster setup is static. The nodes would be running all the time. This had a huge implication on the cost as the cluster size grew. We would be paying for the infra even when our worker nodes are sitting idle. Furthermore, the worker nodes are also not auto-scalable.
Motivation to evaluate Amazon MWAA
Now, we know our challenges and pain points, we started looking for an alternative solution which could solve at least some of the above challenges.
With the release of MWAA by AWS in November 2020, we got an opportunity to explore and evaluate the managed service if it can curate most of our needs and solve some of the above challenges. Few points on why we selected MWAA for our Proof of Concept(PoC).
- Managed service provided by AWS
Halodoc infra is mostly hosted on AWS cloud. Leveraging MWAA as a service is easy and obvious for us.
- Autoscaling for worker nodes available in MWAA
We have configured Celery executor for the self-managed airflow cluster. It’s very hard for us to enable the auto scaling for the Celery worker’s node. But MWAA provides out of the box auto scaling for Celery executors, which was also a compelling factor for us to evaluate MWAA from a cost saving perspective.
- Easy configuration setup
Airflow configuration in MWAA is easy to manage and update. One can simply update the config from the MWAA console and see the changes reflected in the environment.
- Easy interaction with other AWS services
With MWAA, it is very easy to integrate with other services like EMR when services are within the same VPC.
- Cluster set-up
Setting up the MWAA environment was an easy process. With just a few settings one can spin off the MWAA cluster.
- Airflow Version
MWAA comes with 1.10.11 version by default and new versions would be available with the future upgrades. Though Airflow 2.0 is already announced but still there is no ETA from AWS to be available in MWAA.
- DAG/Plugins Deployment
MWAA reads the dags and custom plugins from S3. All the dags files should be uploaded to S3.
Cost was the most important aspect for us to evaluate MWAA. Though Airflow provides scaling up or down for the worker nodes, we have to evaluate the cost for the whole environment. (This is as per the AWS pricing for MWAA pricing for the Singapore region). https://aws.amazon.com/managed-workflows-for-apache-airflow/pricing
Though MWAA environment costs are high but for those who are running 1000’s of tasks in the prod environment, MWAA would be cheaper than on-premise infra. Also the cost depends on what kind of workload one is running on the on-premise cluster. If the DAGs are running daily or weekly basis, then one can save a huge amount of cost with MWAA. Since we have a varied set of DAGs scheduled for different intervals of time (hourly, daily , weekly and monthly) we expect that we would be able to achieve only minimal cost savings. At the same time, however, we do expect to increase/maximise our cost savings as our DAG sizes increase in the future.
5. Monitoring & Alerting
MWAA is integrated with Cloudwatch for metrics and error monitoring. MWAA publishes different environment or DAG level metrics to cloudwatch. With the help of Cloudwatch alarm, we were able to define the rules and send notifications to SNS (if the value crosses the threshold of the alarm). In addition to that, we also have AWS lambda functions to consume the event from SNS and send alerts to Slack channel.
MWAA sends a lot of metrics to Cloudwatch. Few metrics what we consider to be important for us to get an alert in real-time were:
a) Queued tasks
An alert is sent to Slack whenever the task gets queued up in the SQS and the executor is not able to execute the tasks.
b) Scheduler Heartbeat
The scheduler is the most important component of Airflow. If the scheduler is not healthy then none of the tasks will get executed. We have configured alerts whenever the scheduler is unhealthy.
Challenges or gaps with MWAA
Though AWS provides good support of Airflow via MWAA, it currently lacks some functionality or have some of the challenges:
- Error visibility
Currently, all DAGs are read via S3. Whenever a new DAG deployment is done it’s very hard to identify the errors and one has to rely on Cloudwatch logs for error visibility. Currently, MWAA lacks the API kind of response for the errors in the DAGs.
- Long time for deployments for Custom Plugins and libraries
All the custom plugins need to be zipped and uploaded to a S3 location to be read by Airflow. On every plugin deployment, the MWAA environment needs to be updated, which in many times can take 20-30 mins to be updated. This increases the deployment time and DAG validation.
All of the libraries needed for Airflow DAGs or custom plugins need to be mentioned in the requirements.txt file and uploaded to S3. Everytime a single library is added or updated, it requires the MWAA environment to be updated.
- Unsupported for most of the Airflow commands
MWAA still doesn’t support most of the airflow CLI commands. A most useful CLI command to identify the errors in the DAG i.e airflow list_dags is also not supported by MWAA.
- No access to metadata.
MWAA doesn’t provide direct access to its metadata. Though it is not required by us immediately, but in future if DAGs need to be migrated to on-premise or other cloud services then it would be a very cumbersome process i.e one have to manually clone the DAG to other services.
- Cloudwatch Logging only
Currently, MWAA only supports logging via Cloudwatch. One cannot configure S3 or Elasticsearch logging.
In this blog post, we shared some of our learnings of our evaluation of MWAA and also went through some of the challenges with our self-managed airflow cluster. Along the way, we also called out our evaluation criteria and have identified a bunch of challenges/gaps in MWAA in its current form.
Since MWAA is a new service provided by AWS, it requires some improvements in the next version. Though MWAA has some missing functionality, we have migrated some of our dags in MWAA and are running in parallel to evaluate and migrate all the dags after validation and proper testing. We hope AWS will be able to address all these challenges in their next release of 2.0 version, which will enable us to leverage the new features of Airflow 2.0 like HA scheduler, Rest API’s among others.
Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at email@example.com.
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 2500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.