Optimizing Data Storage Through Automated Redundancy Cleanup

At Halodoc, our mission is to build a scalable and cost-efficient data platform which continues to evolve alongside our expanding ecosystem. With growing data volumes and a diverse set of services and pipelines, ensuring optimal resource usage has become a key focus area for our Data Platform.

In this blog, we highlight how we improved data storage efficiency and system observability by automating redundancy cleanup across our datalake architecture. Through a combination of Airflow DAGs, AWS policies, and internal tools, we were able to identify unused data assets, track unutilized services, and enforce lifecycle policies, resulting in measurable cost savings and long-term platform hygiene.

Why Was This Cleanup Necessary?

As our data platform evolved and expanded, we began to observe a number of inefficiencies across storage and compute layers that were impacting performance, maintainability, and cost:

A growing number of tables were onboarded for temporary reports or one-time use cases, but never decommissioned — leading to cluttered metadata and unnecessary scans.
Legacy DMS tasks and endpoints, created for migrations or backfills, remained active despite being unused.
S3 buckets began accumulating a high volume of redundant or stale files, such as logs, raw exports, and temporary job outputs.
Event-based source systems were generating frequent small files, degrading Spark job performance and inflating S3 request costs.

Since it was not practical to manually inspect each table, bucket, or endpoint, we introduced tooling and automation to address these issues systematically.

To identify storage-heavy S3 buckets, we used AWS S3 Storage Lens, which provided aggregated metrics such as:

Total storage per bucket
Number of objects
Growth rates
Small file distributions

This allowed us to quickly pinpoint high-impact areas across hundreds of buckets without manual inspection, helping us prioritize cleanup efforts effectively.

Identifying Unused Tables in the Datalake

We observed that over time, many data tables stopped receiving updates and lost relevance as reporting needs changed. However, these tables remained in our metadata catalog, contributing to storage and processing overhead.

Methodology

We developed an Airflow DAG to automate the identification and flagging of such tables. It uses metadata-driven rules to:

Find tables with no new data in over 12 months
Validate that the table has no downstream dependencies
Update the table's metadata to mark it as inactive
Log the change to an audit table
Notify relevant stakeholders via alerts
Add a failsafe to monitor inactive tables for unexpected data landings using the incremental file tracker, so they can be re-enabled if needed

Impact

Reduced metadata clutter and improved discoverability
Fewer unnecessary scans in reporting jobs
Full visibility and traceability of deactivated tables

Monitoring Unused DMS Endpoints and Failed Tasks

As our platform evolved, legacy migrations and backfill processes introduced numerous AWS Data Migration Service (DMS) resources — particularly replication tasks and endpoints — that were no longer actively in use. These dormant resources often remained unnoticed over time, quietly accumulating in the system.

While these unused endpoints and stalled tasks may seem harmless, they can lead to increased cloud costs due to persistent resource allocation.

Methodology

We deployed a DAG to:

Identify idle endpoints not linked to any active replication tasks
Detect failed or stalled tasks that haven't run successfully for a set period
Alert engineering teams for review and cleanup

Under the hood, the DAG uses the boto3 to programmatically interact with AWS DMS APIs. Specifically, we apply filters like ReplicationTaskStatus and task associations to efficiently retrieve only the resources of interest, avoiding unnecessary listing and processing of all DMS artifacts.

This approach allowed us to:

Query DMS replication tasks and endpoints across environments.
Filter out those marked as "failed", "stopped", or "inactive".
Cross-check task associations to determine if an endpoint is orphaned.

All this is done in a lightweight and cost-effective manner, without relying on any external tooling beyond boto3.

Impact

Lowered resource usage and operational noise
Proactive detection of data movement issues
More efficient monitoring and response workflows

Applying S3 Lifecycle Policies for Storage Efficiency

We discovered several buckets storing outdated logs, staging data, and one-time-use files — some of which had not been touched in months or even years.

Methodology

Using S3 Storage Lens, we:

Identified buckets with high object counts and large volumes
Flagged those with disproportionately small files
Classified them based on criticality and retention needs
Applied S3 lifecycle rules to auto-expire or transition low-value objects

Bucket / Use Case	Action Taken	Estimated Savings
Log Buckets	Lifecycle rules to delete logs after X days	Moderate
Staging Data Buckets	Lifecycle rules to auto-delete after X days	High
Script Storage Buckets	Removed redundant or version control files	Moderate
Query Result Buckets	Cleanup-only lifecycle after use	Low
Redundant RDS Config Databases	Deleted non-operational instances	Moderate
Dev/Stage CloudWatch Logs	Turned off non-essential logs	Low

Impact

Significant S3 cost reductions
Improved performance in dev/stage environments
Faster queries and less API usage due to reduced file counts

Classifying Data Assets for Intelligent Cleanup

To avoid disrupting critical operations, we built a classification framework to help us differentiate between useful, redundant, and obsolete assets.

Asset Type	Classification	Notes
System Logs	Non-active	Retain only recent logs, auto-delete older
Redundant Folder Structures	Redundant	Cleaned up from all environments
Historical Staging Buckets	Non-active	Used only for past data exports
Replicated Export Buckets	Redundant	Kept only one regional copy
Temp Processing Zones	Temporary	Used only during ETL job handoffs
Legacy Config Databases	Redundant	Deprecated, replaced by centralized config
Unused System Configs	Redundant	Removed from orchestration layer

Handling Small Files from Event-Based Sources

Systems generating high-frequency event data often emit very small files, which:

Increase scan times for Spark jobs
Inflate S3 API requests
Limit storage optimization in transition to S3 storage classes

Solution

We introduced a PySpark-based job that:

Periodically merges small files
Applies compression
Stores the output in a cost-effective S3 storage class

For sources occupying a significant portion of the platform footprint (e.g., ~12%), we also initiated deeper audits to assess ongoing value.

Summary

Through a combination of automation, analytics, and policy enforcement, we successfully optimized data storage and improved system hygiene:

🧹 Automated Table Cleanup: Identified and archived stale tables, reducing clutter and metadata scan times
🔍 DMS Resource Monitoring: Removed unused endpoints and surfaced failed tasks with actionable alerts
🧊 Lifecycle Policies: Applied to logs, staging data, and temporary zones, reducing data volume and API costs
🗃️ Asset Classification: Helped clean up redundant folders, config entries, and outdated jobs across environments
🪄 Small File Compaction: Reduced overhead and improved query performance from event-based data sources

These changes led to:

📉 16% reduction in production S3 size and 40% reduction in stage S3 size
📉 6.28% reduction in S3 cost (prod and stage)
📉 32% reduction in Redshift warehouse storage and 8% reduction in cost

These efforts continue to reinforce Halodoc's vision of building a fast, lean, and scalable data platform to support modern healthcare.

References

Join us

Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and deliver quality healthcare across Indonesia, from Sabang to Merauke. Since 2016, Halodoc has been improving health literacy in Indonesia by providing user-friendly healthcare communication, education, and information (KIE). In parallel, our ecosystem has expanded to offer a range of services that facilitate convenient access to healthcare, starting with Homecare by Halodoc as a preventive care feature that allows users to conduct health tests privately and securely from the comfort of their homes; My Insurance, which allows users to access the benefits of cashless outpatient services in a more seamless way; Chat with Doctor, which allows users to consult with over 20,000 licensed physicians via chat, video or voice call; and Health Store features that allow users to purchase medicines, supplements and various health products from our network of over 4,900 trusted partner pharmacies. To deliver holistic health solutions in a fully digital way, Halodoc offers Digital Clinic services including Haloskin, a trusted dermatology care platform guided by experienced dermatologists.We are proud to be trusted by global and regional investors, including the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. With over USD 100 million raised to date, including our recent Series D, our team is committed to building the best personalized healthcare solutions — and we remain steadfast in our journey to simplify healthcare for all Indonesians.