Evolution of Batch Data Pipeline at Halodoc

data-enginnering Apr 17, 2020

Data is the lifeblood for every tech company, more so in the case of Halodoc where we handle sensitive healthcare data of millions of users. In order to serve our users and internal stakeholders effectively, one of our primary requirements was to create a robust data pipeline to ensure seamless movement of data across all of our services. In this blog post we will provide some insights on data pipeline evolution at Halodoc, the challenges we faced & some key take aways in modernising the data pipeline.

What is Batch Processing?
Batch processing is an efficient way of processing large amounts of datasets. The process involves collecting data for a certain window/interval, performing some transformation (like aggregation) and sinking resulting data to a destination.

Need for a batch pipeline:

  • Most of our data sources have transactional data. We needed a scheduled batch processing pipeline that migrates data to our data warehouse periodically.
  • For quicker decision-making, data generated from various business units like Teleconsultation, Pharmacy Delivery, Appointments and Insurance needed to be combined/transformed for analysis. In order to store all the relevant data in our data warehouse (or any single location), the batch pipeline was required.

Architecture of Early Batch Pipeline

The early data pipeline at Halodoc comprised of different types of data sources, data migration tools and the data warehouse as shown above. Data from various sources is loaded into the Amazon Redshift data warehouse using multiple migration tools.

The architecture involves:

  1. Extract - Data from various sources is extracted using migration tools like Pentaho, DMS, and Glue. The different data sources included transactional data sources (Amazon RDS & DynamoDB), ad-hoc flat files (in csv and xlsx format), third party analytic tools (AppsFlyer, Google Analytics, Mixpanel etc.).
  2. Load - The second step involved loading the data to S3. S3 can store data in CSV or JSON formats. The Redshift copy command is used to unload data from S3 to Redshift. These jobs which perform an extraction to unloading data to redshift are known as replication jobs. A replication job is responsible for the migration of data to a data warehouse without any transformation. We replicate all our RDS data sources to Redshift as per our business requirements.
  3. Transform - Once the data is available on Redshift, we deploy Pentaho jobs to perform complex joins & transformations and store the results back to Redshift. These are nothing but materialised views (combined and aggregated data suited to better cater to business needs) which help the Business Intelligence team execute their queries more efficiently in tools such as Looker and Metabase.

The purpose of having multiple migration tools early on was:

  • Pentaho: An open-source tool for batch processing of data.
  • AWS DMS: Used for populating data in real-time to Redshift (i.e CDC - ChangeData Capture). DMS Documentation
  • AWS Glue: Glue Jobs are written when Redshift Spectrum has to be used to query the data stored in S3. This is useful when we need to query huge amounts of data (a rare occurrence). Glue Documentation

What made us modernise our batch pipeline

Pentaho is an open-source tool that we used initially to set up our data pipelines. With the exponential increase in the volume of data being processed, we saw a sharp increase in our execution times. Upon further investigation, some of the limitations we uncovered were:

  • Pentaho was installed only in one EC2 instance without a high availability cluster. So, there was a chance that a single point of failure might occur and that scalability of the cluster would be hard to achieve.
  • Due to its drag and drop features to create a job, it was not possible to create a CI/CD for Pentaho jobs.
  • Alerting was not set on a job level.
  • Only ELT jobs could be written using Redshift resources.

AWS DMS
DMS tasks were responsible for real-time data ingestion to Redshift. As Redshift is optimised for batch updates, we decided to separate the real-time pipeline. More on this can be found here - Velocity: Real-Time Data Pipeline at Halodoc

AWS Glue
Glue as a managed ETL tool was very expensive.

After discussing internally we decided to replace our migration tools with more modern tools that could support both ELT and ETL use cases.

So, from that point on, we began our research on the several migration tools available in the market. Our decision making parameters for the new tool were:

  • Support for a High Availability(HA) and High Scalability
  • Ease of creation of task/jobs
  • Monitoring and alerting at the job level
  • Data validation
  • Support for most of the data sources that Halodoc uses
  • Built-in scheduler
  • Cost implications - preferably open sourced
  • Stateful transformation (ETL/ELT)

After some exploration, we identified a couple of data migration tools that suited our use cases such as Apache Nifi, Apache Beam, Luigi, Apache Airflow and Apache Spark. We selected Apache Nifi and Apache Airflow for our PoC since they were able to cater to most of our use cases. We ran the PoCs on our staging environment for a couple of weeks.

Some of our key findings were:

It was a hard decision to select one of the tools based on the above comparison. After spending some more time using the tools, we finally decided to go ahead with Apache Airflow - a tool for defining, scheduling, executing and monitoring workflows. It operates on the concept of DAG. A DAG (Directed Acyclic Graph) is a collection of all the tasks you want to run, organised to reflect their relationships and dependencies.

Some of the interesting features that nudged our decision to choose Airflow were:

  • Airflow provides the features to create a custom operator and plugins which help templatize the DAGs to make it easy for us to create/deploy new DAGs.
  • More control over the job and can be tailored as per the need (Nifi/Pentaho as a drag and drop feature restricted us from modifying their features).
  • It can be used for both data migration and orchestration of other cron jobs.
  • It can be used for advanced data use cases such as deploying ML models.
  • SLA at job level can be defined and notifications of a miss could be sent to our slack/mail.
  • Easy to backfill the data if the job fails.
  • CI/CD can be created.

Architecture of Current Batch Pipeline

This is what our current architecture looks like. Multiple migration tools like Pentaho, DMS, Glue were replaced by a single tool ie. Apache Airflow. We have both ETL or ELT jobs deployed on airflow. How we achieved ETL/ELT is:

ETL: The use case of ETL was that our business teams needed data in near real-time i.e 15/20 mins interval. As data in Redshift was already delayed by 2-3 hours, we couldn’t fulfil the request with older pipelines. Also, the challenge was to join different schemas relying on separate RDS instances. So, we came up with an approach to perform transformation using Airflow memory and to load the data to Redshift via S3. The steps involved:

  • Data is extracted from various sources using an Airflow operator.
  • All data is converted into pandas data frames using the DbApiHook.
  • These data frames are joined, transformed and then loaded into Redshift via S3. When we used the pandas data frames, we noticed that the DAG execution time was only a few seconds long. These DAGs are deployed on a 10/15 mins interval achieving the business needs and requests within their expected SLA.
Pictorial Representation of ETL Job

ELT: ELT jobs are mostly written when the data is needed on a daily or weekly basis. Transformations involving vast amounts of data cannot be achieved using Airflow memory. These are written as ELT jobs where we utilise the Redshift resource for data processing. Once the data is processed we store it back to S3 to ensure that the unload to Redshift happens using the Redshift copy command instead of the insert command.

  • Data from various sources are loaded to Redshift via S3 periodically.
  • A complex SQL query with transformation logic is written using PostgresOperator and stored back to Redshift via S3.
Pictorial Representation of ELT Job

Challenges faced when migrating existing jobs to Airflow

Biggest challenge while migrating jobs from Pentaho to Airflow was Business Continuity. Our business was growing and already we had many data jobs that need to be maintained and enhanced. Also there were continuous requests for new jobs. We had to fulfil both the requirements: business needs and migration of existing jobs to Airflow.
Some of the key challenges were:

  • Migrating existing Pentaho and DMS tasks to Airflow.
  • The team was a new to Airflow so there was a learning curve when it came to utilising its features for our use cases.
  • Managing the Airflow cluster.
  • Making sure that any new requirement is addressed using Airflow only.
  • Creating custom operators and plugins.

Modernising Batch Data Pipeline - Key takeaways

Halodoc data pipeline evolution journey has been very exciting and enabled us to modernise our data pipeline for future data needs. There were many learnings along this entire journey and here are some of the key takeaways:

  1. Evaluation process for new tools - Helps us to choose the right tool meeting our specific needs.
  2. Balancing business continuity with tool migration activity - Talk to the stakeholders like BI, business etc. and communicate the need for new tools.
  3. Need for automating the pipelines - reusable templates for repetitive jobs, job deployments using CI/CD etc.    

Summary

In this blog post we provided a brief overview of data pipeline evaluation at Halodoc. We also provided some insights on the tool evaluation process and some of the key challenges/takeaways in this journey.

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for data engineers/architects and  if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.


About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 1500+ pharmacies in 50 cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.