Halodoc’s Layered Data Validation Strategy for Building Trust in the Lakehouse

Data isn't just a commodity, it's a critical asset that demands absolute precision in accuracy, consistency, and completeness, as even small discrepancies can cascade into incorrect analytics and business decisions. We have been leveraging a Lakehouse architecture for several years to manage our data scale. However, having a robust architecture is only half the battle. The real challenge is ensuring that the data flowing through that architecture is accurate, consistent, and reliable at every stage of the pipeline.

We operate with a philosophy that "assumed quality is no quality." To guarantee trust, we’ve implemented a comprehensive, multi-layered validation pipeline that catches issues before they ever reach a dashboard or a patient-facing feature.

Why This Approach?

There are several existing frameworks for data validation, including Great Expectations and Deequ, which provide automated checks and data quality enforcement. While these tools are useful, we built a custom framework tailored to our needs. Our solution integrates directly into our pipeline, leverages AI to generate validation queries, and provides full logging and alerts. This approach gives us flexibility, early detection of issues, and complete end-to-end control, ensuring every dataset is trustworthy before it’s consumed downstream.

High-Level Architecture

Data Movement and Validation Layer Overview

Before diving into the validation logic, let's look at the data journey. Our pipeline moves data through several stages of transformation:

  1. Source: Amazon Relational Database Service (RDS) databases (Transactional systems).
  2. Processed: Amazon EMR (running Spark and Hudi) processes data into our Data Lake (Processed Zone S3), cataloged via Glue and accessible via Athena.
  3. DWH/PL: Data moves to Amazon Redshift to serve as our Data Warehouse (DWH) and Presentation Layer (PL).
  4. Internal RDS Serving Zone: Processed data is written back to RDS for internal-facing applications, including reconciliation systems, reporting services, and audit workflows for internal teams

Each arrow in this diagram represents a potential point of failure, such as a schema mismatch, a dropped record, or a transformation bug. To mitigate this, we introduced specific validation checkpoints at every critical transition.

High-Level Data Validation Pipeline

The validation pipeline is designed to be source-agnostic and configuration-driven. Multiple source systems (Source 1, Source 2) feed data into a centralized processing hub. Rather than hardcoding validation logic per pipeline, all rules and execution parameters are defined in a Control Table / Template.

The control table specifies what to validate (tables, columns, join keys, metrics, thresholds) and how to validate it. Based on this configuration, the processing hub dynamically executes the appropriate validation logic—running equivalent queries across different sources or generating join-based checks when validating related datasets. The outcome of each validation run is then written to a Validation Result store, which serves as the single source of truth for data quality metrics, monitoring, and alerting.

This design allows us to scale validation across many datasets and sources while keeping the execution logic generic, reusable, and easy to evolve as new data sources or validation rules are introduced.

Layer 1: The "Pulse Check" (RDS vs. Data Lake)

The first layer of defense is ensuring that what we have in our source systems matches what lands in our Data Lake. We call this the "Pulse Check" with objective to verify high-level data consistency between the Source (RDS) and the Destination Data Lake.

The Relativity of "Count"

A naive approach would be to simply run SELECT COUNT(*) FROM source and SELECT COUNT(*) FROM target and compare the results. However, in a high-velocity environment, this approach is destined to fail.

  • Source Velocity: By the time the COUNT query on RDS finishes (Time T1), the database has already accepted N new transactions.
  • Pipeline Latency: The data in Processed Zone (S3) reflects the state of RDS at Time T1 - delta, where delta is the cumulative latency of DMS replication, S3 ingestion overhead, and Spark Hudi process.

The Solution: Time-Bound Validation

To overcome the relativity of moving data, we don't count "everything." Instead, we freeze the window of comparison. We execute a targeted count based on the creation timestamp:

By defining a specific created_at window (typically looking at a stable window in the recent past), we eliminate the noise of incoming transactions and pipeline lag.

  • Implementation: Orchestrated by MWAA (Managed Workflows for Apache Airflow) running on Kubernetes.
  • Outcome: If a mismatch between the RDS count and Athena count for that specific time window, an alert is immediately sent to our internal chat.

This layer ensures that we are not missing any data during ingestion or initial processing.

Layer 2: AI-Assisted Structural Validation (Processed → DWH)

At Validation Layer 2, checks focus on data completeness and structural correctness. These validations ensure that data ingested from upstream systems has not been corrupted, partially loaded, or structurally altered during transformation. AI assists in generating validation queries, helping engineers quickly create checks without writing every query manually.

In practice, this means validating:

  • Row counts across systems (Data Lake vs. DWH)
  • Critical column completeness (e.g., primary keys or required attributes are not null)
  • Time-bounded consistency to account for ingestion latency

To support this, the same validation logic is executed against multiple engines (Athena and Redshift), and the results are compared. AI analyzes the transformation SQL to generate baseline validation queries automatically, while engineers review and approve them before execution. Below is an example of a validation query.

Shown here is an example of an AI-assisted SQL query used in Layer 2. It demonstrates how the system compares row counts and structural integrity between the Processed Layer S3 (Source) and Redshift (Target) layers to confirm that no data was lost or corrupted during the ingestion process.

Layer 3: AI-Assisted Warehouse Verification (DWH → PL)

The PL is where business meaning is applied—metrics, categorizations, and reporting logic. As a result, validation at this layer focuses on business correctness, not structural parity.

Typical checks include:

  • Detecting duplicate business keys
  • Enforcing non-null constraints on critical business attributes
  • Validating domain rules (e.g., allowed values for status or classification columns)

Here, AI again assists by analyzing the PL aggregation SQL and generating corresponding business validation rules. Since both the DWH and PL reside within Redshift, these validations are executed directly as SQL queries in the warehouse and scheduled as part of the transformation workflows. Engineers remain in the loop to review and approve the generated logic.

This code snippet represents Layer 3 validation, where the focus shifts from structure to business meaning. It shows how specific domain rules—such as checking for duplicate business keys or enforcing non-null constraints—are executed directly within the warehouse to guarantee data reliability for downstream reporting.

Layer 4: Data Reconciliation / Internal Team Validation

The final validation layer moves processed data into internal systems for reconciliation, reporting, or auditing by other teams. Each dataset is defined in a control table that specifies the source, target, date filters, and any additional conditions.

Since schemas often differ between our Data Lake and operational databases, direct comparison isn't straightforward. To address this, we rely on a configuration-driven Mapping Table that defines the relationships between source and target columns. The pipeline uses this map to dynamically generate validation queries, comparing the Processed Zone S3 source against the Amazon RDS target. Any batch where the values diverge beyond an acceptable threshold is immediately flagged for review. Below is the example of Control Table Configuration

This screenshot offers a glimpse into the "Control Table," the brain of the validation pipeline. It displays how engineers define validation parameters, such as source tables, target tables, and date filters. This allows the team to manage rules via configuration rather than hard-coding checks into the pipeline.

Visibility & Resolution

Data quality isn't useful if no one sees it.

  • Monitoring: We use Metabase to visualize our quality metrics over time, allowing us to spot trends and look at the detailed data.
  • Alerting: All layers are integrated with our Chat system. When a validation fails, the on-call engineer gets a ping with the specific table and error, enabling rapid resolution.

Observed Benefits

Implementing these validation layer has delivered several measurable improvements:

  • Reduced data incidents: Silent upstream changes are caught before they impact downstream systems.
  • Increased trust: Analytics and product teams can rely on data without manual verification.
  • Clear traceability: Each validation failure is logged and alerted with context, making investigation straightforward.
  • Faster issue resolution: Validation layer catch problems early, reducing downtime or incorrect analytics.

Summary

  1. Layers Matter: Single-point validation is insufficient for modern Data Lakes. You need checks at ingestion, transformation, and egress.
  2. AI is an Accelerator: Using LLMs to generate validation logic has massively reduced the manual burden on our engineering team while increasing test coverage.
  3. Trust is Earned: By validating and verifying every hop, we ensure that Halodoc's data remains a trusted source of truth for the business and our patients.
  4. Clinical & Operational Impact: This multi-layered strategy ensures that every dashboard and patient interaction is based on verified data, reducing risk and boosting confidence in clinical decisions.

Join us

Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels, and if solving hard problems with challenging requirements is your forte, please reach out to us with your resume at careers.india@halodoc.com.

About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and deliver quality healthcare across Indonesia, from Sabang to Merauke. 

Since 2016, Halodoc has been improving health literacy in Indonesia by providing user-friendly healthcare communication, education, and information (KIE). In parallel, our ecosystem has expanded to offer a range of services that facilitate convenient access to healthcare, starting with Homecare by Halodoc as a preventive care feature that allows users to conduct health tests privately and securely from the comfort of their homes; My Insurance, which allows users to access the benefits of cashless outpatient services in a more seamless way; Chat with Doctor, which allows users to consult with over 20,000 licensed physicians via chat, video or voice call; and Health Store features that allow users to purchase medicines, supplements and various health products from our network of over 4,900 trusted partner pharmacies. To deliver holistic health solutions in a fully digital way, Halodoc offers Digital Clinic services including Haloskin, a trusted dermatology care platform guided by experienced dermatologists.

We are proud to be trusted by global and regional investors, including the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. With over USD 100 million raised to date, including our recent Series D, our team is committed to building the best personalized healthcare solutions and we remain steadfast in our journey to simplify healthcare for all Indonesians.