Data Platform 2.0 - Part I

data-platform Oct 5, 2021

Data platforms have revolutionized how companies store, analyse and use data — but to use them more efficiently they need to be reliable, highly performant and transparent. Data plays an important role in making business decisions and evaluating the performance of the product or a feature at Halodoc. As Data Engineers at the biggest online healthcare company in Indonesia, one of our major challenges is to democratize data across the organization. The Data Engineering  (DE) team at Halodoc has been maintaining and processing high volume and variety of data with the existing tools and services since its inception, but as the business grows, our data volume has also grown exponentially and requires more compute to process it.

As modern data platforms gather data from many disparate, disconnected, and diverse systems they are prone to data collection issues like duplicate records, missed updates, etc. To resolve these problems we conducted a thorough evaluation of our data platform and realized that the architectural debts accumulated over time caused most cases of data irregularities. All major functions of our data platform — extraction, transformation, and storage had issues that led to the stated quality concerns with our data platform.
Our existing data platform served us well from the last couple of  years, but it is not quite scalable to meet all the growing business needs.

A look back to how we evolved :

In the old data platform, most of the data was migrated to Redshift on a regular interval from various data sources. Once the data was loaded to Redshift, ELT operation was performed to build a DWH or data mart tables that serves various business use cases.

The data platform was able to cater most of our needs but as we see the business growth and volume of data with the increase in data use cases, we started facing multiple challenges in our data platform to serve the business needs.

Let’s first list down these issues:

  1. Storage and Compute tightly coupled
    We mostly relied on an ELT based approach where the Redshift compute layer is heavily utilised for any transformations of data. Our Redshift cluster comprises multiple dc2.large instances where storage and compute  are tightly coupled. Whenever, we need to scale the storage we pay for compute also. This raises both scaling and costing problems.
  2. High Data Latency
    The data latency in the current pipeline was almost more than 3-4 hours since the data was first loaded in Redshift and then ELT operations were performed with several intervals. Since pandemic, our business and products team wanted to analyse the data in lower latency so that they can make key business decisions faster.
  3. No Data Governance
    There was no proper data governance implemented in the existing data platform. Groups were created in Redshift and users are assigned to each group based on their role. This still has some control of datasets but not in granular level like column or row level access control.
  4. Lacks of Visibility on what dashboards are built on what datasets.
    Since all the data mart/dwh tables were created based on use cases as and when users requests them to the DE team, there were multiple tables holding the duplicate data. Since we did not follow a  data model(star or snowflake schema),  it became very hard to maintain the relationship among tables in Redshift.
  5. Missing SCD management
    SCD stands for Slowly Changing Dimensions. SCD is very important when someone wants to know the historic value of the data points. In the current data mart there was no proper SCD being implemented. In our case, like the price of medicine, the category of doctors etc. are important features to track.
  6. Data movement via Airflow memory
    At Halodoc, most of the data flow happens via Airflow. All the batch data processing jobs are scheduled on Airflow where data movement happens via Airflow memory which brings another bottleneck to deal with increasing volume of data. Since Airflow is not a distributed data processing framework,  it is better suited for workflow management. Quite a few of  the ETL jobs were written in Python to serve micro batch processing pipelines with 15 mins interval and were scheduled in Airflow.
  7. Missing Data Catalog
    Data Catalog is very important for any data platform to provide the meta information of data. Data catalog was missing in the existing platform for tables that were directly migrated to Redshift. Data catalog was created for only data stored in S3. This becomes an issue for our end users to retrieve information about the tables available in Redshift.
  8. No integrated Data Lineage
    Today, if someone is interested in knowing the source and the stages of transformation of the target data tables, we don’t have a data lineage to show them. Data Lineage is important in understanding the data flow, data transformation and makes it easy to debug the data if wrong information is generated at the target.
  9. Missing Framework driven platform
    For every use case, we mostly build data pipelines from end-to-end. Most of the code was repeated in multiple data pipelines. Software Engineering principles were missing in Data Engineering tasks. Due to this it is hard to decouple the components on each layer and create an abstraction layer that automates the entire framework end-to-end.
  10. No automated Schema Evaluation
    Schema evaluation is very important when you deal with relational data. There would be a change in the source system which needed to be reflected in the target system without any pipeline failure. Today, we do this manually. We have set up a process where DBA informs DE about the schema changes and DE takes responsibility for making the changes in the target system. We wanted an automated way of doing these as there can be a miss by the DBA team to inform DE team.

With these limitations in our data platform, we realized that we had come to the end of the road with the first generation of our data platform. It was at this point that we decided to take a step back, think of what we needed out of our data platform. We were not afraid to build a system from the ground up if we had to.

The Data Engineering team started evaluating and revamping the existing architecture with new data platform  that supports or mitigates most of the above limitations. We came across LakeHouse architecture which plays a vital role in achieving scalability with cost efficient solutions and also  deal with massive volumes of data. Hence, we started working around  the Lake House architecture for building our revamped Data Platform 2.0 .

Why we adopted LakeHouse Strategy?

A LakeHouse approach is basically a combination of Datalake and data warehouse where you can seamlessly move your data across lake and warehouse with the security compliance being followed on what has access to what all datasets.

At Halodoc, we wanted to build a scalable solution where we could independently scale the storage and compute as required. We listed down the following as core capabilities we wanted our data infrastructure to have:

  1. Decouple Storage and Compute (Highly Scalable).
  2. Can store all types of data like structured, semi-structured and unstructured.
  3. Can act as a single truth of data across the organisation.
  4. Ability to store/query mutable as well as immutable data.
  5. Easy integration with distributed processing engines like Spark or Hive.

In the new architecture, we leveraged S3 as a Data Lake since it can scale storage indefinitely. Since we planned to store the mutable data as well in S3, next challenge was to keep the mutable S3 data updated. We evaluated couple frameworks like Iceberg, Delta Lake and Apache Hudi that provide this capability to update the mutable data. Since Apache Hudi comes handy with EMR integration, it was easy for us to start building a Data Lake on top of HUDI.

Why Apache HUDI?

  • Upsert operation on flat files.
  • Captures the history of updates with various updates.
  • ACID property.
  • Support different storage type (CoW and MoR)
  • Supports various ways of querying data(real-optimized query, snapshot query and incremental query)
  • Time-travel of the datasets.
  • Pre-Installed with EMR. Zero effort on creating and making it up and running.

Challenges in setting up the platform

  • Most of the components used in the new architecture were new to the team, hence involve some learning curve to get hands-on and production the system.
  • Building the centralised logging, monitoring and alerting system.
  • Supporting regular business use cases in parallel to revamping the architecture.

Summary

In this blog we walked you through some of the challenges/limitations we faced in our existing Data Platform and also some of the missing functionality. In the upcoming blogs we will be discussing more about the LakeHouse architecture, how we are using Apache Hudi and also some of the challenges we faced while releasing this new platform.
As we move forward, we are keep on adding new features in our platform to make more robust and reliable data platform.

Join us

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and  if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.


About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek and many more. We recently closed our Series C round and In total have raised around USD$180 million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Jitendra Shah

Among with Joinal Ahmed

Data Engineer by profession. Building data infra using open source tools and cloud services.