At Halodoc, we use AWS Redshift as a data warehouse, it is the single source of truth for analytics and reporting. In the past few years we have faced various challenges while building and maintaining the data warehouse using Redshift.
This blog covers the optimisation techniques that have been followed at Halodoc to solve various problems.
We have done optimisation at storage, compute and cluster level. Here are the summary of 10 performance tuning techniques :
- Choose the right distribution style and sort key
- Choose the right column compression
- Archive unused data
- Use Spectrum for infrequently used data
- Configure WLM Queues
- Monitoring Redshift Cluster
- Take action on Redshift advisor
- Run Vacuum Sort
- Choose the right Cluster type
- Use copy command
We will go on details one by one.
1. Choose the Right distribution style and sort key:
Distribution style and sort key are very important in Redshift. Choosing the right distribution style and sort key can improve the query performance.
Amazon Redshift distributed rows to each node based on distribution style. The goal in selecting the distribution style is to have the rows distributed evenly throughout the node for parallel processes. There is 4 distribution style in Redshift:
- EVEN: Data distributed in round-robin across nodes.
- KEY: Data with the same value will be placed in the same slice
- ALL: Data is distributed in all nodes, and choose ALL for a smaller dataset that ranges between 1 to 10K.
- AUTO: Redshift automatically decides the best distribution key based on the usage pattern. In general : Small tables are distributed with ALL strategies, and large tables are distributed with EVEN strategy.
Sort key in Redshift determines how data to be sorted based on the key defined. Choosing the right sort key can improve query performance because it allows the query planner to scan fewer data when doing a filter, join, group, and order operations. There are 2 types of sort keys :
- Compound sort key
Data sorted based on order listed in table sort key definition.
- Interleaved sort key
Multiple columns can be defined as interleaved sort key and it gives equal weight to each column, if the query has fixed columns in the filter clause, applying interleaved sort key is best suited to improve performance.
Distribution style and sort key are defined at table creation, but they can be altered later after table creation. Example of alter statement :
ALTER TABLE tablename ALTER DISTSTYLE ALL;
ALTER TABLE tablename ALTER SORTKEY (column_list);
Note: There are restrictions when altering distribution style and sort key. Tables with interleaved sort key can not be altered for both Alter distyle and Alter Sort key.
2. Choose the right column compression
Amazon Redshift is a columnar database, which means each Data Blocks store the value of a single column for multiple rows. Column compression conserves storage space and reduces the disk I/O utilization because it scans fewer data blocks on the disk. When creating a Redshift table sometimes we miss out on providing column compression strategy, it causes the data stored without being compressed and increases the size of the table. To do that we must search for columns that are not being compressed, and we can alter the column compression using alter statement :
ALTER TABLE tablename ALTER COLUMN column_name ENCODE new_encode_type
Note : The column chose as the 1st sort key should not be encoded, and columns with Interleaved Sort key can not be altered.
3. Archive unused data
With growing data in our data warehouse, we need to manage old data efficiently. Some tables or some data from tables are not used by the user and this data can be archived. The idea of the framework is to unload that data from Redshift to S3 and delete it from Redshift. We have scheduled the archival framework to run in off peak hours. So we built a framework that can handle:
- Define data that need to archive
- The first step is to analyze and gather the information on data that can be archived. For example, data from services that have been decommissioned, or transactional data from more than 2 years ago. We can put the detailed information in a config table, such as table name and filter condition.
- Unload data to Amazon S3
- Based on information generated in the previous step, a SQL select statement will be generated and utilizing unload command, data from Redshift will be loaded to S3.
- Delete data from Redshift
- After the data is uploaded to S3, data from Redshift will be deleted using the same filter condition from the previous step.
- Run Vacuum Delete
- Vacuum Delete is used to reclaim storage space after data is deleted from the table.
- In the future, if the user needs to select the data we can enable the glue crawler and create an external schema in Redshift.
4.Use Spectrum for infrequently used data
Using Amazon Spectrum, we can perform SQL query in Redshift from the data stored in S3. At Halodoc, the Amazon Spectrum is used to store 3rd party data that are rarely used by the user. Amazon Spectrum charges for every data pulled from S3, so it is recommended to store the compressed data and also in columnar Parquet format. Below is the step we follow on using spectrum:
- Store compressed data in S3 with Parquet format, each table is stored in a different folder.
- Using AWS Glue, create a data catalog database and schedule a crawler and point the s3 folder path to crawl.
- Create an external schema in Redshift from the data catalog.
5.Configure WLM Queues
Amazon Redshift Workload Management(WLM) enables users to manage query priority within the workload. The goal when using WLM is, a query that runs in a short time won't get stuck behind a long-running and time-consuming query. WLM can be configured on the Redshift management Console. At Halodoc we also set workload query priority and additional rules based on the database user group that executes the query. There are 3 user groups we created based on our user consuming data from Redshift:
Users that execute ETL/ELT load to Redshift. Majorly application writing or copying data to Redshift are assigned to these group.
Used by BI team and reporting dashboard.
Used by analytics team to perform ad hoc query
We set up rules to give high query priority for Transform users, and normal for both BI and Analytics user groups. And also the additional rule is to abort long-running queries that runs by BI and Analytics user groups. By doing this some query will be aborted and the query needs to be optimized.
6.Monitor Redshift Cluster
Monitoring the Redshift cluster is essential to track the performance of the cluster. There are some ways to perform monitoring and alerting Redshift cluster :
- Set an alarm (and send it to a Slack channel). We utilized AWS CloudWatch Alarm and sent it to the Slack channel. There are 3 alarms set up to monitor the Redshift cluster.
- CPU usage alert is sent when CPU usage reaches a percentage threshold after some period of time.
- Disk usage alert is sent when disk usage reaches a percentage threshold.
- DB Connection Alarm is sent when concurrent database connections reach the threshold limit.
- Alerting 5 top long-running queries and sending the query to the slack channel. This is to monitor scripts running on Redshift and will take action to optimize the query. We use Airflow to schedule a query in Redshift to return 5 top long-running query and send it as an alert in the slack channel. We also reach out to the query owner and together we work to optimize the query.
7. Take action on Redshift Advisor
We also frequently monitor the Redshift Advisor. Amazon Redshift Advisor offers specific recommendations based on the performance and usage metric of our cluster. The given advice includes :
- Improve Query Performance with Distribution Keys.
- Run VACUUM SORT on Tables.
- Improve Query Performance and Optimize Storage with Compression Encoding.
- Initialize Interleaved Sort Key.
8. Run VACUUM SORT on Tables
Vacuum Sort is used to sort the data in the table based on the sort key mentioned. We scheduled a vacuum sort command on Airflow and made it run daily. When new data is inserted into the table it needs to be sorted based on the sort key defined that's where VACUUM SORT is useful. This activity is used to maintain the performance of the table. We do not run vacuum sort command on all tables, only table with Percentage of unsorted rows in the table greater than 10 %.
Below query is used to get the table list :
SELECT "schema", "table" FROM svv_table_info where unsorted > 10 .
9. Choose the Right cluster type
Amazon Redshift data warehouse is a collection of computing resources called nodes and organized into a group called a Cluster. Each Cluster has a leader node and one or more compute nodes. There are 3 cluster types in Redshift DS2, DC2, and RA3. DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs), while DC2 instances use SSD to store the data. If you have a compute-intensive requirement in a data warehouse, then the DC2 instance is a suitable cluster type. If you want a cluster that can scale storage capacity automatically without manual intervention and no need to add additional compute resources, RA3 is the suitable cluster type.
10. Use copy command
When doing a migration from another database to Redshift, it is recommended to upload the data to S3 and then using a copy command insert the data to Redshift. Below consideration should be taken while writing data into Redshift.
- Split files based on the number of slices in the Redshift and then run the COPY operation to achieve parallel processing.
- Make sure files are compressed and better to use columnar data format while copying data to Redshift to reduce the I/O operation.
Redshift optimization is a continuous process that needs to be done. In general, applying best practice technique and monitoring the cluster performance is a must. This blog mentions some of the techniques that we follow to maintain the Redshift cluster performance. One important note to follow, we need to consistently take action on each alert that is set up on the Redshift Cluster.
Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at email@example.com.
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek and many more. We recently closed our Series C round and In total have raised around USD$180 million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.