Lake House Architecture @ Halodoc: Data Platform 2.0

data-platform Jan 6, 2022

At Halodoc, we are always committed to simplifying the healthcare services for our end-users, and as the company grows, we are continuously building and delivering new features. What we have built two years back might not support the amount of data we manage today, to address that we decided to revamp our data platform architecture. In our previous blog, we talked about the challenges of existing platforms and why we needed to adopt Lake House Architecture to support the business and stakeholders for ease of access to data.

In this blog, we will go over our new architecture, components involved and different strategies to have a scalable Data Platform.

Let's first have a look at the high-level architecture of the new and revamped Data Platform 2.0.

We divided our architecture into 4 layers:

  1. Data Ingestion/Extraction Layer
    This layer is more concerned with ingesting the data in the raw zone layer that can be later consumed and offloaded in the processed zone. Most of the clickstream capturing tools support in-house data ingestion services from their products that make it easy to get or onboard into the raw zone for further processing. For transactional data sources like MySQL, Postgres, we started leveraging the CDC-based approach for data extraction. Since our infra is mostly hosted in AWS, we chose Data Migration Service(DMS) to perform CDC-based migration.
  2. Processing Layer
    Here, we are not performing any heavy transformation rather converting raw data to HUDI datasets.  Source data is ingested in different formats(CSV,  JSON) which need to be converted in columnar format (e.g parquet) to store them in Data Lake for efficient data processing. Data Types are typecast based on data lake compatibility, and the time zones are adjusted to WIB for timestamps.
  3. Transformation Layer
    One of the big challenges in data engineering is processing a huge amount of data efficiently keeping intact the cost. We selected Apache Spark for processing since it supports distributed data processing and can scale from gigabytes to terabytes of data processing without much hassle. The transformation layer generates a  data model in the data warehouse and becomes the base for the reporting table to consume the data and support dashboards or reporting use cases.
  4. Reporting Layer
    Reporting layer mostly aggregates data from dimensions and facts tables and provides a view on top of these databases for the downstream users to consume. Most of the dashboards will be built on top of these reporting tables and materialized view thus reducing the computation cost of joining different tables for repetitive tasks and reporting use cases.

Once we realize our platform into different layers, the next challenge is to select components that can support most of our downstream use cases. When we look at data engineering tools/products in the market, we can easily find a huge selection of tools. Instead of onboarding or licensing  3rd party tools, we planned to build in-house solutions leveraging AWS cloud and open source projects.

Let's deep dive more into the components used in the above platform.

Components Involved:

  1. DMS
    DMS stands for Data Migration Service. It’s an AWS service that helps to perform CDC(Change Data Capture) on databases like MySQL, Postgres, and many more. We leverage DMS for reading the binary logs from MySQL DB and storing the raw data in S3. We have automated the DMS resource created with the help of the flask server and boto3 implementation. We can easily onboard new tables in raw zone parameters configured in our control table.
  2. S3 - raw zone
    All CDC data captured by DMS are stored in S3 in the raw zone in the proper partition. No data cleaning is performed in this layer. Data is appended in a new file whenever there is an insert or update happening in the source system. The raw zone is very crucial to perform any back-filling of the datasets when needed.
    This also stores the data ingested from clickstream tools or any other data sources. The raw zone acts as a base layer for the processing zone to consume the data.
  3. EMR - HUDI + PySpark
    Apache HUDI is used to leverage UPSERT operation on data sitting in Data Lake. We are running PySpark jobs which run on scheduled intervals, read data from the raw zone, process, and store in the processed zone. The processed zone replicates the behaviour of the source system. There is just an UPSERT operation happening here and converted into HUDI datasets.
  4. S3 - processed zone
    The S3 processed layer is a Data Lake at Halodoc. We store both mutable and immutable datasets. HUDI is being used to maintain the mutable datasets. Immutable datasets like CSV or JSON data are also transformed into columnar format(parquet) and stored in this zone. This layer also maintains or corrects the partition to efficiently query the datasets.
  5. Glue Data Catalog
    AWS Glue data catalog is used to register the tables and is available to query via Athena for ad hoc analysis.
  6. Athena
    Athena is a server-less query engine that supports  querying the  data sitting in S3. Users leverage Athena for any ad hoc analysis of a dataset sitting in a data lake.
  7. Redshift
    Redshift is used as a Data Warehouse to build the data model. All the reporting/BI use cases are served by Redshift. We created 2 layers in Redshift. One layer is responsible for storing all the data models of PD, CD, Appointments, Insurance, and Labs containing facts and dimensions. We have built a reporting layer framework to do aggregation and joins for creating a reporting table that is accessible via BI tools. We also maintain the materialized views in these layers. We also implemented SCD type1 and SCD type2 in our data model to capture the historic changes in our datasets.
  8. MWAA
    MWAA is used to orchestrate the workflow.
  9. Cloud watch and EFK
    A combination of cloud watch and EFK are integrated to build a centralized logging, monitoring, and alerting system.
  10. Dynamodb
    Dynamodb is used in the platform to store the failed events to publish in the control table. A reprocessing framework is developed to process failed events and push them to the control table on the scheduled frequency.

Why CDC based Approach?

At Halodoc, when we started our data engineering journey we went with timestamp-based data migration. ie. we relied on modified timestamps to migrate data from source to target. We almost served for 2 years with this pipeline. As the business grew, there was an exponential growth in our datasets which demands us to increase our migration instance to a bigger cluster to support the high volume of data.

The problem was:

  • High cost due to increased migration cluster size due to high volume of data generated at the source.
  • Data quality issues when modified columns are not updated due to some backend issues.
  • Schema changes were difficult to handle in the target.

With CDC based, we started reading transactional data by enabling binlog(binary logs) in MySQL and WAL(write-ahead log) in Postgres. Ingesting a new file of every event change was a costly operation since there would be a lot of S3 Put operations. To balance the cost, we set our DMS binlog read and pull every 60 sec. At every 1 min, new files were inserted via DMS. CDC-based also solve the problem of high volume growth since instead of hourly interval data we started migrating on max minute intervals.

Using HUDI - key takeaways

HUDI provides built-in features to support open data lakes. While onboarding or integrating HUDI in our platform, we faced a few of the below challenges and tried to solve them.

  1. Retaining the maximum commits in HUDI datasets
    HUDI does cleaning/removing of older committed files based on the config set. By default, it has set commits retained to 10. These have to be set correctly based on one workload. Since we were running most of our transactional tables migration in 5 minutes, we set hoodie.cleaner.commits.retained to 15 so that we give 75 minutes for ETL jobs to complete. Even compaction and clustering add to commit so one has to analyze and set the cleaner policy for an incremental query to run without break.
  2. Identifying which tables to be partitioned
    Having data partitioned in the data lake always reduces the amount of data scanned and makes query performant. Similarly, having large partitions in the lake reduces the read query performance since it has to merge multiple files to do the data processing. We selected our data lake to have a minimum daily partition and planned to archive historical data in other storage layers like Glacier or low-cost S3 storage layer.
  3. Choosing right storage types
    HUDI currently supports 2 types of storage viz. MoR(Merge on Read) and CoW(Copy on Write). One has to precisely choose the storage type based on the use case and workload. We chose MoR for tables having lower data latency access and CoW for tables that can have more than 2 hours of data latency.
  4. Different views of MoR datasets.
    MoR supports _ro and _rt views. _ro stands for read optimised view and _rt stands for real time view. Based on the use case one has to identify which table to query. We selected _ro view for our ETL workload since data latency in the Data Model was around 1 hour.
    Reports built on top of the data lake were querying _rt tables to get the latest view of the datasets.
  5. Indexing in HUDI
    Indexing is very useful in HUDI to maintain the UPSERT operation and read query performance. There are global and non-global indexes. We went with the default bloom index and chose a static column for an index i.e non-global index. We relied on HUDI commit time for incremental data fetch. This also helps to handle late-arriving data to the data lake to be processed without any manual intervention.

Why Framework Driven?

Most of our previous implementation was pipeline-driven, meaning we manually built pipelines for each data source to serve business use cases. In Platform 2.0, we made slight changes to our implementation model and adopted framework-driven pipelines. We started building a framework on each layer such as data ingestion framework, data processing framework, and reporting framework. Each framework is dedicated to performing certain tasks with predefined inputs. Adopting framework-driven reduces the redundant code to maintain and streamline the onboarding process of new tables in the data lake.

Benefit of having a Control Plane in Tabular format

In our platform, Control Plane is a critical component that stores metadata and helps easily onboard new tables in the data lake and data warehouse. It stores the necessary configuration required to enable the data migration. For building any products, metadata plays a vital role to automate and have control over the flow of pipelines. We had different options to choose among Yaml, DynamoDB, or RDBMS.

We selected RDS for the below reasons:

  • Easy to perform any analytics on top of metadata like the number of active pipelines.
  • Easy to onboard new tables or data models.
  • Easy to build API layer with the help of python flask API.
  • Audit can be easily done.

Data Security

Being in the healthcare domain, security has always been a top priority to enable in our Data Platform. We hosted almost all of our infra in the Private subnet and enabled Lake Formation to govern the access to Data Lake. We have also used AWS encryption for data at rest. This provides secure storage of data lake and overall data platform.

Automation

Automation always helped a lot to reduce the engineering effort on building and maintaining the platform. In the Platform 2.0, most of our pipelines are automated with Jenkins and API.

  • We have automated DMS resource creation by deploying a flask server and using a boto3 to create the resources.
  • Almost all of our infra/resources are created via Terraform. SRE played an important role in setting up most of our Data Platform infra.

Logging, Monitoring and Alerting

Though our infra is robust, fault tolerant and highly scalable, there are sometimes unexpected errors that could cause the infra down. To identify and resolve these issues, we have enabled monitoring and alerting for each of the components involved in our data platform using Cloud watch and EFK(Elasticsearch, Fluentbit and Kibana) stacks.

Workflow Orchestration

Any data platform required a scheduling capability to run the batch data pipelines. Since we had already been using Airflow for workflow orchestration in our previous platform, we went ahead with the same orchestration tools. MWAA  had already helped a lot to reduce the maintenance effort and cost savings. We have explained what we evaluated in MWAA in our previous blog.

Summary

In this blog, we looked over the lake house architecture, all components involved in building platform 2.0, and our key takeaways in using HUDI as a Data Lake. Since we now have built the foundational pieces of our Data Platform 2.0, next we plan to concentrate on the following aspects of the platform:

  1. Data Quality  -> To maintain the data checks and data consistency across the data storage.
  2. Data Lineage -> To provide end-to-end steps on data transformation.
  3. Self-Service Platform for BI team -> Reducing the dependency on DE team for onboarding reporting tables.
  4. Handling late arriving dimensions: To maintain consistency across our data models and process late arriving dimension keys from the lake to warehouses.

Join us

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and  if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.


About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek and many more. We recently closed our Series C round and In total have raised around USD$180 million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Jitendra Shah

Data Engineer by profession. Building data infra using open source tools and cloud services.