Enhancing Self-Service Capabilities for Table Onboarding in Redshift

Data Engineering Dec 30, 2024

At Halodoc, Data Engineers are responsible for managing all data pipelines from the raw data layer to the presentation layer. The presentation layer is a schema used by the Analytics team to perform data analysis and create visualisations in tools like Looker or Metabase to enable the business teams in data-driven decision-making.

The presentation layer is highly dynamic, with frequent adjustments to table logic, columns, and structures to keep pace with evolving business needs and strategic goals. This dynamism leads to increased dependencies between the Data Engineering and Analytics teams, particularly when modifications such as new columns, changes to business logic, or even new tables are required.

We aim to reduce these dependencies and empower the Analytics team with a more autonomous onboarding, this blog will highlight our strategy for automation and streamlining the workflow to simplify the table onboarding process and reducing the dependency on the Data Engineering team while changes are made to the presentation layer by the Analytics team.

Background

In our previous workflow, whenever a new business initiative or metric required tracking, the Analytics team would request the Data Engineering team to onboard new columns or even create new tables in the presentation layer. This onboarding process involved multiple steps, with the Data Engineering team fully responsible for preparing and deploying the following components:

  • Airflow DAG Creation
  • Target Table DDL Definition: Specify the data schema and structure for the new table or column.
  • DAG Configuration: Set supporting parameters such as scheduling, table dependencies, and other configuration details.
  • SQL Transformation Preparation : SQL scripts to handle data transformations, including writing queries to clean, aggregate and format raw data into the desired structure. These transformations align the data as per the business requirements.

This onboarding procedure limited the Analytics team’s flexibility, preventing them from making quick changes independently and meeting immediate analytical needs. As data demands from the business continued to grow, this manual onboarding process proved unsustainable at scale, adding significant overhead to the Data Engineering workload.

Approach and Implementation

To give a comprehensive understanding of the automated table onboarding process, the following is a detailed explanation of each step involved in the application workflow illustrated in the diagram.

Application Workflow

Before running Jenkins, the Analytics team needs to push the SQL file to the repository, where it will be peer-reviewed by the Analytics team to ensure that the SQL script is optimized.

Analytics Workflow

Input Parameters (Jenkins)

  • The automation process begins with Jenkins, which was chosen due to its established use within the organization for deployment tasks. Jenkins offers the advantage of providing an input parameter interface that functions much like a web application, allowing users to input required parameters
Jenkins Input Parameters

Clone Automation Script

  • The automation script is written in Python to ensure the latest version is used, with the necessary logic and instructions to execute the subsequent steps in the Jenkins process.

Input Validation

  • Before progressing further, the input parameters are validated to confirm that they meet all requirements. This step is essential to prevent any kind of human error, such as DDL errors or issues arising from incorrect configurations. The validation process includes checks to ensure that the input table does not already exist in the system and that all provided configurations are accurate and complete. If validation fails, the process halts, and an error message is generated, clearly indicating the issue and prompting the user to take corrective actions
Validation Script

Push Airflow DAGs File to Repositories

  • Once input validation passes, we implement dynamic DAGs as explained in this blog to reduce DAG parsing time. The file will only be pushed to the repository if the number of tables in each DAG exceeds the threshold (20 tables each DAG), if the count is within the limit, this process will be skipped.
Dynamic DAG Allocation Script

Insert Config

  • Following the push to the repository, the DAG configuration is inserted, which includes essential Airflow settings such as the schedule and table dependencies. This configuration must be provided in JSON format to reduce the number of input forms. Below is an example of the configuration that will be passed to Jenkins:
Airflow Config

Execute Redshift DDL

  • The system executes the Redshift DDL commands to create or modify tables based on the input specifications. It is important to note that the table schema must be provided in JSON format, similar to the configuration. This step is critical for structuring the data to align with analytical needs.
Table Creation JSON

Deploy DAG and the SQL to S3

  • The DAG file and corresponding SQL script are deployed to an S3 bucket, providing centralized storage that facilitates efficient access by Airflow

Additionally, for adding columns, we have Jenkins job that follows the same approach. The process for adding columns is more simpler; we only need to update the logic in the SQL script and provide a few inputs in Jenkins, such as the table name and new column schema, to perform alterations on an existing table.

Add Column JSON

Conclusion

This automation has successfully eliminated 100% of the tickets and dependencies for creating new tables / add new column in the presentation layer for the Data Engineering team, enabling a self-service process for the Analytics team. The deployment process for a new table now takes approximately 3 minutes, excluding the time required to create the SQL script. Previously, this process required manual process for preparing several code, components and running multiple CI/CD steps, taking around 60 minutes.

Jenkins Job

References

DataWarehouse’s Presentation Layer in Halodoc
The presentation layer facilitates analysts in the process of creating reporting on BI tools and performing analysis and is a single source of truth for related data and has an impact on cost efficiency because BI tools can only access the presentation layer and do not perform transformations.
Dynamic DAG Generation in Airflow: Best Practices and Use Cases
In this blog, we have documented the importance of reducing the top-level code in the dags, reducing dag parsing time, and reducing the total number of DAG Python files.
Demystifying ETL Cross-job Dependencies with Apache Airflow
This blog explains how to resolve ETL Cross-job dependencies with Apache Airflow

Join us

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and  if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.


About Halodoc

Halodoc is the number one all-around healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off, we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek, and many more. We recently closed our Series D round and in total have raised around USD 100+ million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Firda Ramadhan Putra

Data Engineer at Halodoc