Build Monitoring Architecture

Modern engineering teams live and breathe automation. But when you have hundreds of builds and deployments running daily, something interesting happens. You automate everything, except understanding what actually happened.

At some point, we realised that although our CI/CD pipelines were doing their job perfectly, the observability of those pipelines was almost nonexistent. Builds were running, failing, retrying, succeeding — but the insights were buried across Jenkins job pages, console logs, and scattered notifications.

So we decided to solve the problem the SRE way: Build a monitoring system for the build system itself.

The Problem: CI/CD Visibility Was Fragmented

Our infrastructure runs hundreds of builds every day across multiple services and pipelines.

Each build produced valuable metadata such as:

Build status (success/failure)
Failure reason
Log links
Execution timestamp

But this information lived inside Jenkins job history, which meant:

No centralised visibility across pipelines
Hard to identify failure trends
Many operational questions like the ones below were surprisingly difficult to answer:
- What is the overall CI success rate across services?
- Which pipelines or services are most unstable?
- Are failures increasing over time, or improving?
- Which stage of the pipeline fails most frequently?

Answering these questions required manually navigating through Jenkins job histories, console logs, and notifications, which was a tedious approach. As the number of pipelines increased, this process quickly became inefficient and error-prone.

The only way to analyse failures was the traditional approach: scrolling through Jenkins jobs and responses.

Clearly, that wasn’t scalable. We wanted a system that could provide better visibility into the questions above:

Capture build metadata automatically
Store it centrally
Visualise trends and insights
Generate weekly reports

All without introducing additional operational overhead.

The Solution

We implemented a Build Monitoring System that captures build metadata after every CI execution and pushes it into a centralised MySQL datastore for visualisation.

The core idea was simple: Treat build data like metrics

Instead of leaving build information inside Jenkins, we export it and made it observable. Where:

Git triggers builds
Jenkins executes pipelines
Build metadata is pushed to a database
Grafana visualises trends, and scripts generate reports

Why a Custom Monitoring Approach?

While exploring solutions, we evaluated existing Jenkins plugins and monitoring integrations that provide build analytics and reporting. However, most of these tools were designed to store and analyse complete build logs and historical job data, which did not fully align with our requirements.

Instead of relying on a plugin approach, we implemented a lightweight custom solution that exports only the essential metadata from each build.

Key Advantages of This Approach:

Reduced Jenkins Storage Usage: Jenkins job histories can quickly grow in size when storing large numbers of builds along with console logs and artifacts. By exporting only selected metadata fields into a central database, we avoid retaining excessive historical data inside Jenkins itself. This allows us to keep Jenkins lean and operationally efficient, while still maintaining long-term visibility into build activity through the monitoring system.

Full Control Over the Data Model: A custom implementation allowed us to define exactly what metadata should be captured, how failures are categorised, and how the data is structured. This flexibility enabled us to evolve the monitoring system as pipeline requirements change.

Platform-Level Visibility: Most Jenkins plugins provide insights per job or per pipeline, whereas our centralised approach allows us to observe build activity across the entire platform — spanning multiple services, environments, and project types.

By treating build executions as structured data, we created a monitoring system that is customised to our workflows while remaining scalable.

Architecture Overview

The architecture was intentionally designed to remain lightweight and non-intrusive to existing pipelines. Instead of introducing new CI tooling or complex observability, we extended the existing Jenkins workflow to export structured build metadata after every execution. This ensured minimal operational overhead while enabling full visibility into build activity across the platform.

Architecture diagram

Failure Categorisation: One of the most impactful improvements in the monitoring system was introducing structured failure categorisation.

Traditionally, build failures are treated as a single status: FAILED. While this tells us that something went wrong, it provides little insight into where or why the failure occurred.

To make failure analysis more actionable, we categorised failures based on the CI/CD stage where the issue occurred.

This allows teams to quickly identify whether failures originate from code quality issues, security checks, application issues, infrastructure problems, or deployment pipelines.

This dramatically improves troubleshooting speed and helps identify systemic issues across the CI/CD pipeline. The monitoring database is also integrated with MCP (Model Context Protocol) through halodoc-copilot, allowing engineers to retrieve build insights using natural language prompts, making debugging faster and more accessible.

Category	Meaning
Validation	Pre-check failures like linting or config validation
Security	Security scans or vulnerability checks
Compliance	Policy enforcement failures
Build	Compilation or dependency issues
Artifact	Artifact packaging or publishing failures
Test coverage	Unit test or coverage threshold failures
Deployment	Infrastructure or deployment issues
PostDeployment	Smoke tests or runtime validation failures

For example, if most failures occur under the Security category, it may indicate the need for improved secret management or developer awareness. If failures cluster under Deployment, the issue may lie in infrastructure stability. If a project build fails, then the issue lies in the application code.

This simple classification transformed raw failure logs into meaningful operational insights.

Step 1 — Capture Build Entities

Capturing this structured metadata ensures that every pipeline execution becomes a queryable and analysable data point, enabling both real-time dashboards and historical trend analysis. At the end of each Jenkins pipeline, we extract key entities:

Field Name	Description
Id	Unique identifier for the build record
CreatedAt	Time when the record was created
StartTime	Build execution start time
EndTime	Build completion time
Duration	Total build duration in seconds
BuildNumber	Jenkins build number
UserName	User or system triggering the build
ServiceName	Service associated with the build
Namespace	Logical service grouping
Environment	Deployment environment
ProjectType	Type of project (backend/frontend/app)
BranchName	Git branch used for the build
BuildStatus	Build result (SUCCESS / FAILED / ABORTED)
Reason	Failure reason if build fails
Category	Failure category (Build / Security / Test / Deployment)

This metadata is collected directly within the Jenkins pipeline using environment variables and build context information. Jenkins exposes several useful parameters such as build number, branch name, execution timestamps, triggering source, and build result. At the end of the pipeline execution, a script extracts these values and structures them into a payload. This payload is then sent to the monitoring database through a database insert operation, ensuring every build execution is recorded consistently without adding overhead to the pipeline.

Step 2 — Push Data to Database

Once the build pipeline completes, a script collects the relevant metadata and sends it to a centralised database. We only export structured build metadata, keeping the system efficient and focused on observability. This ensures every pipeline execution is captured as a consistent, queryable record. Over time, this creates a reliable dataset for analysing build performance and stability trends across services.

Step 3 — Visualisation with Grafana

After the metadata is stored, Grafana connects to the database and queries this data to build visual dashboards. These dashboards provide a real-time overview of CI/CD activity across the platform, making it easy to understand build behaviour at a glance. Key insights include daily build volume, success versus failure ratios, failure trends, and unstable pipelines. This visual layer transforms raw build data into actionable insights for engineering teams.

Total builds per day
Success vs failure ratio
Most unstable pipelines
Failure trends over time

Summary data

Project level insights

Step 4 — Weekly Reports

Beyond real-time dashboards, we also have an integration for aggregated weekly summaries using database queries. These reports provide a higher-level view of CI/CD health, highlighting metrics. Weekly reporting helps engineering teams and leadership track improvements over time and quickly identify areas that require attention. It also provides a consistent snapshot of the platform's build reliability.

These reports gives our tech teams a high-level view of CI reliability.

Results

After implementing the system, we observed several improvements.

Instant Visibility: Engineering teams can now understand CI/CD health across all pipelines through centralised dashboards, eliminating the need to manually browse Jenkins jobs.

Faster Root Cause Identification: By categorising failures and storing structured metadata, engineers can quickly identify the failure cause. This significantly reduces investigation time during build incidents. After implementing the system and analysing the collected data, our production build success rate improved from 70% to 92%.

Data-Driven Build system Improvements: We can now measure build stability trends, CI reliability, and deployment success rates. CI/CD became observable instead of mysterious.

Leadership-Level Reporting: Weekly aggregated reports provide leadership with a high-level overview of CI reliability, enabling better visibility into engineering productivity and deployment health.

Conclusion

CI/CD pipelines are the backbone of modern software delivery. But without proper visibility, they quickly become a black box.

By treating build executions as observable data, we transformed our pipeline from a collection of Jenkins jobs into a measurable, monitorable system.

When CI systems run hundreds of builds every day, the question is no longer “Did the build pass?”
The real question becomes: “What patterns are our builds revealing about the health of our engineering platform?”

Observability doesn’t just apply to user facing production systems — it applies to the systems that build them as well.

Join us

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and deliver quality healthcare across Indonesia, from Sabang to Merauke.
Since 2016, Halodoc has been improving health literacy in Indonesia by providing user-friendly healthcare communication, education, and information (KIE). In parallel, our ecosystem has expanded to offer a range of services that facilitate convenient access to healthcare, starting with Homecare by Halodoc as a preventive care feature that allows users to conduct health tests privately and securely from the comfort of their homes; My Insurance, which allows users to access the benefits of cashless outpatient services in a more seamless way; Chat with Doctor, which allows users to consult with over 20,000 licensed physicians via chat, video or voice call; and Health Store features that allow users to purchase medicines, supplements and various health products from our network of over 4,900 trusted partner pharmacies. To deliver holistic health solutions in a fully digital way, Halodoc offers Digital Clinic services including Haloskin, a trusted dermatology care platform guided by experienced dermatologists.
We are proud to be trusted by global and regional investors, including the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. With over USD 100 million raised to date, including our recent Series D, our team is committed to building the best personalized healthcare solutions — and we remain steadfast in our journey to simplify healthcare for all Indonesians.