Velocity: Real-Time Data Pipeline at Halodoc

data-enginnering • Feb 4, 2020

Halodoc backend consists of multiple microservices that interact with each other via REST APIs. These microservices have access to their data but rely on the APIs to retrieve data from other microservices. Business domains like Pharmacy Delivery, Tele Consultations, Appointments, and Insurance have their own microservices. But from Halodoc's operations point-of-view, a single transaction by a user might span across multiple business domains.

Given how querying data from multiple sources might slow down our operations, we set up our data platform to copy the data from individual databases to AWS Redshift via DMS jobs. The DMS jobs copied any edited data over to Redshift in real-time. The data in Redshift is used by the operations team for various purposes like BI Reporting (through PowerBI/Looker), real-time monitoring of business metrics (Metabase dashboard) and ad-hoc queries (Metabase Questions). But with an increase in business, we saw our backend data sources increase rapidly. This started impacting the performance of the RedShift system, along with a substantial increase in query response time.

When the data team analysed the root cause, we found that RedShift, like most Data Warehouse systems, is optimised for batch updates, especially when the read traffic is less. Hence, instead of DMS jobs (that copy the data in realtime), the data team planned to use the ETL jobs to copy the data to Redshift in batches (with separation of multiple hours). Non-availability of data on RedShift (in real-time) was not a concern for reporting but was a major change for the real-time dashboards. This drove the need for a separate real-time data pipeline and Velocity was born.

Architecture

Data from individual microservices are currently published to SNS (a lightweight eventing system between microservices). AWS Lambda functions are used to enrich this event data (by calling the REST APIs of microservices) and enriched events are pushed into Kafka topics. Custom jobs deployed on the Flink cluster read the events from Kafka, perform the aggregations and store the result in Elasticsearch indices. Kibana displays the data through dashboards.

Velocity comprises of following components:

Apache Kafka – Kafka has become a common denominator in most open-source stream processors as the de-facto Storage Layer for storing and moving potentially large volumes of data in a streaming fashion with low latency.
Apache Flink – Open-source platform that provides data distribution, communication, state management and fault tolerance for distributed computations over data streams.
Elasticsearch – Open source data store primarily optimised for search, but of late has become very popular as a Serving Layer Store for operational and business metrics.
Kibana/Grafana – Open source visualisation frameworks that connect to the Elasticsearch data store and act as Serving Layers.

Kafka as an Event Store

We decided to choose Kafka for the event store in Velocity due to following reasons:

Kafka is a distributed publish-subscribe messaging system.
Kafka can handle high-velocity real-time data.
Kafka provides a high availability and fault tolerance system. It replicates the data based on use cases so that data can be availed easily in case of system failure.

Apache Flink for Aggregation

Apache Flink is chosen over other stream processing frameworks like Apache Storm, Apache Spark Streams and Kafka Streams for following reasons:

Flink is the true stream processing framework (doesn’t cut stream into micro-batches) and processes events at a consistently high speed with low latency.
Flink’s kernel (core) is a streaming runtime which also provides distributed processing, fault tolerance, etc.
Built-in state management for intermediate stream processing data. It supports storing state data on multiple frameworks – Hadoop HDFS, AWS S3, Google GFS, etc.
Provides multiple deployment options – from single JVM to multi-node cluster, from native deployment to resource manager based deployments (Kubernetes, Mesos, YARN).
In-built integration support for popular sources (Kafka, AWS Kinesis, RabbitMQ) and sinks (Elasticsearch, Cassandra, Kafka), etc.
In-built support for different telemetry tools for monitoring & alerting – JMX, InfluxDb, Prometheus, etc.
Used for large scale stream processing in companies like GoJek, Netflix, Alibaba, Lyft, etc.

Elasticsearch as Serving Data Store

While storing time series based metrics, InfluxDB and Prometheus are some of the popular choices. But since these do not scale with increasing data and workload, Elasticsearch is quickly becoming a popular alternative. ElasticSearch clusters can easily be scaled up/down when needed and managed by the DevOps. Further, Elasticsearch is supported as a first-class data store in visualisation tools like Kibana and Grafana.

Some key benefits:

Elasticsearch is a near real-time search platform. There is only a slight difference between the time you index a document to when it becomes searchable.
High Availability and fault tolerance concepts can be easily implemented with Elasticsearch clusters.
Since Elasticsearch is already used for other use cases like medicine/doctor search, centralised logging etc. both Developers and DevOps are familiar with it.

Kibana as Visualisation Tool

Before choosing Kibana as a visualisation tool for velocity, we ran POCs on various other tools in the market. But with Elasticsearch as our choice for the data store, Kibana is arguably the best option for data visualisation. Some of the tools we tried were Redash, Sisense, Knowi and even Looker. But all of them had some limitations based upon our use cases. Some of the key features that made Kibana a better choice for us:

It supports most of the visualisations needed by our operations teams
Kibana has very granular time filter options and the refresh rate is fast
Supports RBAC using OAuth or in-built plugin i.e x-pack
Support for Elasticsearch as a data source
Ease of use: Kibana UI is user-friendly and can be learned in a couple of days

Velocity Monitoring & Alerting
At Halodoc, we use Prometheus to collect the real-time monitoring metrics from flink and Elasticsearch while Grafana is used for monitoring purposes. We have defined some alert rules for Flink and Elasticsearch which send us an alert over Slack to help us to react in time.

Flink Alert Rules:

When the Flink job manager is down
When Flink task manager is down
Whenever any Flink job fails

Elasticsearch Alert Rules:

Whenever the Elasticsearch cluster is down
Any one node fails in the cluster i.e either master or data node
Heap usage is high for 15 mins.
When disk space reaches 80%

Challenges

Many of the technologies/frameworks used in Velocity like Kafka, Flink, ZooKeeper are new for the tech team and hence had an initial learning curve.
We were unaware of the right sizing of Flink clusters. Continuously needed to monitor the Flink cluster.
Migrating Elasticsearch from AWS managed service to self-hosted Elasticsearch in EC2 as the RBAC feature was not available in managed service.

Impact after Velocity was Delivered

Ops teams now rely on Velocity dashboards for real-time use cases where they can make decisions based on real-time data. (eg. proactively adding more doctors of a certain specialisation because of increasing demand during the day)
With all real-time use cases migrated to Velocity, our ad-hoc reporting tool, Metabase is now at near-zero downtime

Velocity - Next Steps

Automating the Flink job deployment using CI/CD.
Moving Flink cluster to Kubernetes.
Build more dashboards as per business needs.

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for data engineers/architects and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 1500+ pharmacies in 50 cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.