Enabling Data Intelligence: Data Profiling Framework at Halodoc

Data Engineering May 29, 2026

How we built a scalable, config-driven data profiling system to bring visibility, trust, and self-serve intelligence to our data pipelines.

Data is only as good as our understanding of it. As datasets grow into the hundreds of tables and stretch across multiple storage systems, even basic questions, like whether a column is trustworthy, how two tables relate, or how a dataset evolves over time, become surprisingly hard to answer. At that scale, ad-hoc SQL exploration stops being viable, and data quality shifts from a nice-to-have into a core engineering concern.

This post explores what data profiling really means, why it matters for any organisation working with large, distributed data, and the common challenges teams face when trying to maintain trust in their data. It then walks through how we approached this problem at Halodoc by building an Airflow-native Data Profiling Framework, a three-pronged system covering column-level profiling, join intelligence, and source table analysis, designed to give analysts and engineers a shared, reliable view of their data.

The Problem We Faced

Before this framework existed, BIA (Business Intelligence & Analysis) and DE (Data Engineering) teams regularly hit the same walls:

Problem	Impact
No column-level visibility	Analysts guessed nullability, uniqueness, and data ranges
Manual `SELECT COUNT(*)` checks	Time consuming, inconsistent, not tracked
Join failures discovered late	Pipeline breakages found in production
No historical baseline	Impossible to detect regressions or drift
DE bottleneck	Every data question required an engineer

Teams were writing exploratory SQL queries repeatedly, with no reuse and no memory. A BIA analyst joining two tables for the first time had no way of knowing if those tables had been reliably joined before or which column to use.

What Is Data Profiling?

Data profiling is the process of examining datasets to collect statistics and metadata that describe their structure, content, and quality. Done well, it answers questions like:

Is this column reliable enough to join on?
Which values exist in this categorical field?
How does data volume grow day over day?
Are two tables actually joinable? How have they been joined before?

At its core, profiling computes four classes of checks across a dataset:

Check Type	What It Tells You
Null Checks	% of missing values per column
Uniqueness	Distinct count, usability as a key
Distribution	Value frequencies, min/max, mean, median
Schema/Structure	Data types, length ranges, empty strings

Together, these checks form the foundation for everything downstream, from trust in dashboards to confidence in joins.

Framework Architecture

The framework was designed as a modular, Airflow-native system where each profiler solves a different aspect of data understanding. While the profilers operate independently, they follow a shared orchestration and storage pattern, allowing results to be centrally queried and visualized through Metabase dashboards.

At a high level, the architecture consists of three profiling components: column-level profiling, join intelligence extraction, and source table analysis.

Design Decisions & Trade-offs

Compute: Kubernetes Pods

Profiling queries are pushed down to the database engine (Redshift or Athena)
Kubernetes pods keep it lightweight and isolated

Batch Over Streaming

Data profiling is inherently retrospective. Understanding data that already exists does not require real-time processing
On demand DAG triggers (no schedule) let teams profile only when needed

Storage: Redshift + S3

Results land in Redshift for queryability
Athena query results are unloaded to S3 as Parquet files and then loaded into Redshift staging for further processing
Staging table pattern prevents partial writes to production tables

Config-Driven, Not Hardcoded

All profiling parameters (tables, date ranges, filters, group columns) are passed via a trigger config
No code changes needed to profile a new table, just trigger the DAG with a JSON config

Parallel Execution via Dynamic Task Mapping

Both profilers use dynamic fan out via Airflow's expand and map pattern
Each table gets its own pod, fully isolated, one failure does not block others

SQLGlot vs SQLLineage: Why SQLGlot?

For the Join Intelligence profiler, we needed to parse existing SQL files and extract join relationships. Two serious candidates:

Dimension	SQLLineage	SQLGlot
Primary Purpose	Lineage extraction (table level)	Full SQL parsing & AST traversal
Join Column Extraction	Not designed for it	Native — `Column`, `Join`, `EQ` nodes
Dialect Support	Limited	Spark, Redshift, Postgres, Trino, BigQuery
AST Access	Abstracted away	Full programmatic AST access
Error Handling	Opaque	`ParseError`, `TokenError` — catchable
Speed	Slower on large files	Fast, pure Python
Fallback Control	None	We control dialect fallback logic

Why SQLGlot? We needed to extract which columns two tables join on and not just which tables reference each other. SQLLineage's column lineage is transformation oriented and not designed to extract join predicates. SQLGlot's AST gives us direct access to Join nodes, ON conditions, and Column references.

We also implemented a two pass dialect strategy: try the detected dialect first (Spark or Redshift based on keyword scoring), then fall back to the opposite, then fall back to regex extraction.

Key Features

Column-Level Profiler

The Column-Level Profiler generates profiling SQL dynamically based on the source engine and pushes execution directly to Redshift or Athena, avoiding any movement of raw data outside the warehouse. It supports three profiling modes:

Type 1 — Column Health: computes null percentage, distinct count, min/max values, length ranges, and numeric statistics using dynamically generated UNION ALL queries so the table is scanned only once.

Type 2 — Value Distribution: generates GROUP BY queries across string columns to compute value frequencies and percentage distribution for categorical analysis.

Type 3 — Sample Records: uses ROW_NUMBER() OVER (PARTITION BY ...) to extract one representative row per group for raw data inspection.

All outputs are first written into staging tables and then inserted into final result tables with a unique run_id for safe reruns and historical tracking. Optional filters such as date ranges and custom WHERE clauses allow profiling to be scoped to specific time windows.

Join Intelligence

The Join Intelligence profiler extracts real join relationships directly from production SQL files stored in S3. Before parsing, SQL files pass through a normalization pipeline that cleans encoding issues, replaces template macros, and standardizes dialect-specific syntax.

The profiler then:

detects the likely SQL dialect
parses SQL using SQLGlot AST traversal
extracts tables, join columns, join types, and operators
resolves aliases across nested CTEs and subqueries

If parsing fails, the workflow retries with an alternate dialect before falling back to regex-based extraction. Every extracted relationship is tagged with:

source file
detected dialect
extraction method (sqlglot_primary, fallback, regex)

The extracted joins are batch inserted into Redshift to build a searchable relationship graph for downstream exploration.

Source Table Analyser

The Source Table Analyser profiles source systems safely without performing expensive full-table scans on transactional databases.

Instead of reading raw data directly, it primarily relies on information_schema and engine metadata to collect:

table size
estimated row count
column count
index information

The analyser automatically discovers timestamp columns and uses indexed time ranges for bounded sampling when available. Non-indexed tables fall back to metadata-only estimation to avoid unsafe scans.

Historical Hudi metrics such as file sizes, partition counts, and compaction efficiency are then combined with source metadata to generate:

partition recommendations
Spark executor sizing suggestions
compaction priority
partition explosion warnings

All recommendations and health signals are persisted into Redshift and surfaced through dashboards for operational visibility.

Implementation Highlights

The framework was designed to be operationally safe, scalable, and easy to extend. Since profiling workloads run across multiple tables and engines, the orchestration layer focused heavily on validation, isolation, and predictable execution.

Airflow DAG Structure

Every DAG follows a validation first pattern before any compute is triggered:

start → validate_config → decide_execution_path → profiling_tasks → end

The validation layer checks required parameters, profiling mode, engine compatibility, and date filters. Invalid configs fail immediately with a clear error rather than consuming warehouse or Kubernetes resources. A Branch Operator then routes execution to the correct profiler based on mode.

Kubernetes Pod Fan Out

Each table expands into its own isolated Kubernetes pod at runtime using Airflow Dynamic Task Mapping:

run_profiler = KubernetesPodOperator.partial(
    task_id="run_profiler",
    deferrable=True,
    reattach_on_restart=False,
).expand(
    arguments=table_configs.map(...)
)

deferrable=True releases the Airflow worker while pods run so no worker slots are held. reattach_on_restart=False prevents ghost pod reattachment on scheduler restart. One table failure does not block others and retries remain fully isolated.

Pushdown Compute

All heavy computation runs directly inside Redshift or Athena. No raw data moves through Airflow workers. Only aggregated profiling results are transferred back.

Staging Table Safety Pattern

For Athena to Redshift flows, a staging pattern guarantees atomic and re-runnable writes:

Athena UNLOAD → S3 (Parquet/GZIP) → COPY into staging → INSERT into production with run_id

DELETE FROM staging_table;

COPY staging_table
FROM 's3://bucket/path/'
FORMAT AS PARQUET;

INSERT INTO production_table
SELECT *, :run_id FROM staging_table;

This prevents partial writes, keeps reruns idempotent, and ensures production tables always contain complete profiling snapshots. The run_id on every row enables historical comparison across runs.

Sample Outputs

Type 1 — Column Health

This output provides a high-level health summary for every column in a table. It helps answer questions such as:

Can this column be safely used as a join key?
Does the column contain excessive nulls?
Are there unexpected value ranges or outliers?
How unique is the data?

For example:

order_id has 0% nulls and 1,000,000 distinct values, indicating it behaves like a reliable primary key.
status has only 7 distinct values, making it a good candidate for categorical analysis.
amount shows a wide numeric range (0 → 9.9M), which can help detect abnormal spikes or data quality issues.

column_name	data_type	null_pct	distinct_count	mean_val	min_val	max_val	total_records
order_id	varchar	0.0	1000000	—	ORD-0001	ORD-9999	1000000
status	varchar	0.002	7	—	cancelled	pending	1000000
amount	double	0.05	84320	145230.5	0.0	9900000.0	1000000

Type 2 — Value Distribution

This output focuses on categorical distribution analysis. It is particularly useful for understanding business state distributions, skewed datasets, and dominant values.

In this example:

completed orders dominate the dataset (71.2%)
pending and cancelled form smaller but operationally important segments
sudden changes in these percentages over time can indicate pipeline regressions or business anomalies

This profiling mode is especially useful for:

dashboard validation
anomaly detection
business KPI monitoring
categorical feature understanding for ML workflows

column_name	value	count_record	percentage
status	completed	712000	0.712
status	pending	180000	0.18
status	cancelled	108000	0.108

Join Relationships

The Join Intelligence profiler converts production SQL into a searchable relationship graph between datasets.

This output helps analysts understand:

which tables are commonly joined together
which columns are trusted join keys
how datasets are connected across domains
where join inconsistencies may exist

For example:

orders.customer_id → users.id indicates a standard customer enrichment pattern
consultations.doctor_id → doctors.id shows healthcare-domain relationships extracted directly from production queries

Because these joins are extracted from real SQL workloads instead of manually curated metadata, they reflect actual production usage patterns.

left_table	left_column	right_table	right_column	join_type	query_file
orders	customer_id	users	id	LEFT	sql-query/oms/order_summary.sql
consultations	doctor_id	doctors	id	INNER	sql-query/care/consult_agg.sql

Performance & Scalability

The framework was designed to scale across hundreds of tables while keeping Airflow, warehouses, and source systems operationally safe.

All profiling queries are pushed down directly to Redshift or Athena, so computation happens inside the warehouse itself instead of moving raw data through Airflow workers. Only aggregated profiling results are transferred back.

To scale horizontally, the framework uses Kubernetes pod fan-out where each table runs in its own isolated pod.

KubernetesPodOperator.partial(...)
    .expand(arguments=table_configs)

This provides:

linear scalability across tables
isolated retries and failures
predictable resource allocation

For Join Intelligence, the framework skips known non-production paths such as backup and one-time activity folders before parsing SQL files. Extracted joins are committed in configurable batches to avoid long-running Redshift transactions.

if len(batch) >= commit_size:
    connection.commit()

For the Source Table Analyser, scalability focuses on protecting OLTP systems. The analyser uses indexed timestamp columns for bounded sampling and falls back to metadata-only estimation when indexes are unavailable, avoiding expensive full-table scans.

Together, these optimizations allow the framework to process large-scale profiling workloads efficiently while keeping infrastructure usage predictable and safe.

Real-World Impact

Area	Before	After
Column Understanding	Writing the queries, executing them, and reviewing the results to understand a single table took close to an hour	Full column profile ready in around 10 minutes, self serve via Metabase with no SQL needed
Join Discovery	Finding the right join key was hit and trial, taking anywhere from 30 minutes to an hour and often still causing data loss	Join candidates found in under 10 minutes by querying the relationships table directly
Quality Baseline	No baseline existed, regressions went undetected for days or weeks until something broke	Historical runs tracked per execution, drift visible immediately on the next profiler run
Partition Strategy	Engineers manually reviewed table sizes and growth trends before deciding, often taking half a day	Recommendations generated automatically in a single analyser run, no manual investigation needed
Compaction Issues	Discovered only during production incidents, by which point resolution took hours	Flagged proactively on every run before they have a chance to impact any job

Challenges & Learnings

SQL diversity is harder than expected

Production SQL files contained Jinja macros, smart quotes, BOM characters, half written BETWEEN clauses, and dialect mixed queries
Solution: a layered normalization pipeline before any parsing attempt

AST extraction needs fallback layers

No single parser handles all real world SQL perfectly
Solution: primary dialect → opposite dialect → regex — with method tracked per file for observability

Staging table concurrency

Multiple pods writing to the same staging table caused race conditions
Solution: DELETE + INSERT pattern with per run isolation

Information schema row counts are estimates

Row counts from the database's information schema are approximate for InnoDB tables
Solution: flagged in output metadata; indexed time range COUNT used as the primary source when available

Partition explosion is a silent killer

Some tables had 50k+ partitions due to misconfigured granularity
Solution: explicit threshold check (>10,000 partitions) with a warning flag in output

Conclusion

The Halodoc Data Profiling Framework turned data understanding from a manual, engineer-dependent process into a self serve, automated system. By combining:

Column-level profiling for data quality and trust
Join relationship mining via SQLGlot AST analysis
Source table intelligence for informed ingestion decisions

We gave engineers visibility into production data quality, surfacing issues before they silently propagate downstream.

Data profiling is not a one-time activity, it is infrastructure. The earlier you build it into your pipelines, the more trust your data platform earns.

References

Join us

Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and deliver quality healthcare across Indonesia, from Sabang to Merauke.
Since 2016, Halodoc has been improving health literacy in Indonesia by providing user-friendly healthcare communication, education, and information (KIE). In parallel, our ecosystem has expanded to offer a range of services that facilitate convenient access to healthcare, starting with Homecare by Halodoc as a preventive care feature that allows users to conduct health tests privately and securely from the comfort of their homes; My Insurance, which allows users to access the benefits of cashless outpatient services in a more seamless way; Chat with Doctor, which allows users to consult with over 20,000 licensed physicians via chat, video or voice call; and Health Store features that allow users to purchase medicines, supplements and various health products from our network of over 4,900 trusted partner pharmacies. To deliver holistic health solutions in a fully digital way, Halodoc offers Digital Clinic services including Haloskin, a trusted dermatology care platform guided by experienced dermatologists.
We are proud to be trusted by global and regional investors, including the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. With over USD 100 million raised to date, including our recent Series D, our team is committed to building the best personalized healthcare solutions — and we remain steadfast in our journey to simplify healthcare for all Indonesians.