Securing File Uploads: How We Built a Multi-Layer Validation System at Halodoc

Introduction

At Halodoc, file uploads power critical healthcare workflows — from prescriptions and lab reports to insurance documents. With over 300k uploads processed daily, a single malicious file can expose sensitive data, disrupt services, or exhaust resources. Simple extension-based checks fail to catch threats like embedded JavaScript in PDFs or compression bombs in archives.

To address this, we built a multi-layer file validation framework aligned with the OWASP File Upload Cheat Sheet using Java, Apache Tika, and PDFBox. This validation operates as part of our broader security infrastructure — alongside network-level protections, access controls, encryption, and runtime monitoring — and is designed to augment, not replace, these existing measures.

Understanding the Threat Landscape

Before diving into our solution, it's important to understand why file uploads are such an attractive target for attackers—and why traditional defenses often fail.

The Limits of Basic Validation

Many applications rely on simple checks: verify the file extension, trust the Content-Type header, or limit file size. These methods are easy to implement but leave significant gaps:

  • Extensions lie — a file named invoice.pdf can contain anything; the extension is just a label
  • Headers are client-controlled — the Content-Type header is set by the sender and trivially spoofed
  • Size limits aren't enough — a 1 KB file can still contain a malicious script or formula

Real-World Threat Examples

File-based attacks come in many forms. Here are some common patterns that security teams encounter:

Disguised Executables: An attacker renames a harmful file to look like a document. Without content-based detection, the system accepts it based on its claimed type.

Embedded Scripts in Documents: PDFs and Office files can contain active content—scripts that execute when the document is opened. A seemingly innocent invoice could carry hidden code.

Archive-Based Attacks: Compressed files can be weaponized in multiple ways:

  • Compression bombs — Small files that expand to massive sizes, exhausting system resources
  • Path traversal — Archive entries with names like ../../../etc/config attempting to write outside intended directories
  • Nested archives — Deeply nested files designed to evade scanning

Formula Injection in Spreadsheets: CSV and Excel files can contain formulas that execute when opened. Cells with formulas can trigger unintended actions in spreadsheet applications.

Polyglot Files: Advanced attacks use files that are valid in multiple formats simultaneously—appearing as an image to one system but executing as code in another context.

The Core Problem

The reality is clear: what a file claims to be and what it actually is are often two different things.

To truly secure file uploads, we need to look beyond surface-level checks and validate files at multiple levels—from filename to content structure.

Our Multi-Layer Validation Framework

Our file validation operates as an additional security layer on top of existing infrastructure-level protections. We designed multiple independent validation dimensions that work together — each targeting a different class of threat. A file must satisfy all validation checks before being accepted into the system.

Tech Stack at a Glance

Capability Library / Tool Purpose
Content-Type Detection Apache Tika MIME type detection from file content (magic bytes)
PDF Scanning Apache PDFBox Document structure & active content inspection
CSV Scanning Apache Commons CSV Cell-level formula injection detection
Office & Archive Scanning Java ZipInputStream Compression bomb, path traversal & macro detection

Filename Validation

The first validation dimension ensures the filename is safe and doesn't contain patterns that could be exploited, such as path traversal attempts or injection characters.

Content-Based Type Detection

Instead of trusting what the client declares, we analyze the file's actual content to determine its true type. We use Apache Tika for MIME type detection based on file content (magic bytes). The detected type is then compared against both the claimed type and an explicit allowlist.

Resource Limit Checks

Certain files can be crafted to consume excessive resources during processing. We enforce strict limits to prevent resource exhaustion, including file size caps per MIME type and, for archive files, checks on entry count and compression ratio.

Content Security Scanning

This validation dimension performs a deep inspection of file contents based on the detected file type. Each scanner is specialized to understand the structure and potential threats specific to that format. The scanner framework routes each detected MIME type to a dedicated scanner, ensuring only relevant validations execute.

Below are illustrative examples of how some of our format-specific scanners work. These are simplified representations — the actual implementations include additional checks and optimizations.

PDF Scanner

PDFs are complex documents that can contain active content. Our scanner uses Apache PDFBox to analyze the document structure and detect potentially harmful elements such as embedded scripts or automatic execution triggers.

Image Scanner

Images can contain hidden data in metadata fields or be crafted as polyglot files that are valid in multiple formats. Our scanner validates image structure integrity and checks for anomalous content.

CSV Scanner

Spreadsheet files can contain formulas that execute commands when opened — a technique known as CSV injection or formula injection. Our scanner normalizes cell content and checks for dangerous patterns.

Document Scanner (Office Files)

Modern Office files (.docx, .xlsx, .pptx) are ZIP-based archives in OOXML format. We validate their archive structure and check for embedded macros by inspecting the ZIP entries directly.

Archive Scanner

Compressed files require special attention to prevent extraction-based attacks. Our scanner validates archive structure and inspects contained entries for dangerous content.

Real-World Validation Walkthrough

To see how these validation dimensions operate in practice, consider the following scenario.

The file in question: report.pdf, 420 KB, declares Content-Type: application/pdf.

Opening the raw PDF reveals the threat hidden in its structure:

/OpenAction << /S /JavaScript /JS (app.alert("exploit");) >>

This is a JavaScript auto-execute trigger — the moment a vulnerable PDF reader opens the file, the script fires.

How the validation responds:

  • Filename Check: report.pdf passes — no path traversal, no null bytes, no reserved characters
  • Content-Type Detection: Header bytes confirm a genuine PDF. MIME type matches the declared type. Passes
  • Resource Limits: 420 KB is within the configured PDF size ceiling. Passes
  • Content Security Scan: The PDF scanner parses the document tree, walking all page actions and catalog entries. It finds /OpenAction of type /JavaScript. This matches a known threat pattern. 

Result: Rejected

The upload is blocked before it reaches permanent storage, and a structured security event is recorded.

Add a False Positive Handling Section

A strict multi-layer system will occasionally flag legitimate files — particularly in healthcare, where documents from third-party labs or insurance providers may use non-standard structures.

How we handle this:

  • Structured rejection reasons: Every blocked upload returns a specific reason (e.g., "PDF contains JavaScript""ZIP bomb detected: suspicious compression ratio""Polyglot or script content detected in image"), not a generic error — so support teams can triage quickly.
  • Support-driven escalation: When a legitimate file is rejected, users raise the issue through our support channel. The support team escalates it to the engineering team, who evaluate the rejection reason against the file's actual content and adjust validation rules if necessary.
  • Iterative rule tuning: Each false positive investigation feeds back into our validation logic. If a legitimate document pattern is being flagged, we refine the detection rules to reduce false positives without weakening security coverage.

Security Framework Integration Architecture

Our file security validation integrates seamlessly with both synchronous and asynchronous upload workflows, ensuring consistent protection across all use cases.

Synchronous Validation Flow

Used when real-time feedback is required. The file is validated before being persisted

Benefits: Immediate feedback, simple error handling, no orphaned files

Asynchronous Validation Flow

Used for large files or high-throughput scenarios. Files go to temporary storage first and are validated before promotion.

Benefits: Reduced upload latency, scalable for large files, backend resource optimization

Performance at Scale

Security should never come at the cost of user experience. Our validation framework is optimized for the scale of a healthcare platform.

Latency Performance

Key Insight: The majority of validations complete in under 10ms, with the 95th percentile completing in under 13ms.

These measurements reflect typical production workloads across our standard file size distribution. Validation latency varies based on file size, type, and the depth of scanning required — larger or more complex files may take longer.

The latency results reflect deliberate design choices:

  • Fail-fast validation: Lightweight checks execute first. Invalid or disallowed files are rejected before deeper inspection begins
  • Type-aware routing: Each detected MIME type is routed to a dedicated scanner, ensuring only relevant validations execute with no redundant cross-format work
  • Resource boundaries: Strict file size limits and defensive handling prevent abnormal inputs from degrading system stability under load

Key Security Principles We Follow

Building this system reinforced several important principles:

  1. Never Trust Client Input — Always validate server-side, regardless of what the client sends.
  2. Layered Validation — Multiple validation dimensions each target different threat classes. While no single check catches everything, together they significantly reduce risk.
  3. Fail Secure — When in doubt, reject the file rather than risk accepting something harmful.
  4. Type-Specific Validation — Different file types have different threats; use specialized scanners.
  5. Balance Security and Usability — Optimize validation to be thorough yet fast enough for real-time use.
  6. Validation ≠ Security — Upload validation is one essential component of a secure file handling pipeline, alongside access controls, encryption, runtime monitoring, and incident response

Tradeoffs and Design Considerations

Designing a multi-layer file upload validation system required balancing security depth with real-world usability and operational efficiency.

  • Security vs Latency: Deep inspection increases processing cost, but early-exit ordering ensures deep inspection only runs when necessary
  • Strictness vs Usability: Conservative rules may reject edge-case files; we prioritize secure defaults in healthcare workflows
  • Specialization vs Simplicity: Format-specific scanners add complexity but provide more precise threat detection per format

These tradeoffs helped shape a system that is secure, scalable, and predictable under real-world workloads.

Conclusion

In healthcare, every file carries trust—trust that patient data is protected, that documents are authentic, and that systems are secure. By implementing a multi-layer file upload validation framework, we've built a foundation that protects healthcare documents at scale while maintaining the speed and reliability our users depend on.

Securing file uploads at scale requires more than checking extensions. It requires understanding threats, building defense in depth, and constantly evolving. At Halodoc, we're committed to this journey—because protecting healthcare data is at the heart of what we do.

Join Us

Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels, and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and deliver quality healthcare across Indonesia, from Sabang to Merauke.
Since 2016, Halodoc has been improving health literacy in Indonesia by providing user-friendly healthcare communication, education, and information (KIE). In parallel, our ecosystem has expanded to offer a range of services that facilitate convenient access to healthcare, starting with Homecare by Halodoc as a preventive care feature that allows users to conduct health tests privately and securely from the comfort of their homes; My Insurance, which allows users to access the benefits of cashless outpatient services more seamlessly; Chat with Doctor, which allows users to consult with over 20,000 licensed physicians via chat, video or voice call; and Health Store features that allow users to purchase medicines, supplements and various health products from our network of over 4,900 trusted partner pharmacies. To deliver holistic health solutions in a fully digital way, Halodoc offers Digital Clinic services, including Haloskin, a trusted dermatology care platform guided by experienced dermatologists.
We are proud to be trusted by global and regional investors, including the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. With over USD 100 million raised to date, including our recent Series D, our team is committed to building the best personalised healthcare solutions, and we remain steadfast in our journey to simplify healthcare for all Indonesians.