RDS MySQL performance monitoring at Halodoc

Dec 29, 2019

Healthy backend resources are one of the most important components to a great product. At Halodoc, monitoring and gathering the relevant metrics for all our resources is a vital step. This enables near real time alerting that helps us take the appropriate measures to counter any issues.

We use Cloudwatch alarms for general RDS monitoring. The RDS Enhanced Monitoring, Performance Insights and, the Percona Monitoring & Management (PMM) tools are used for performance troubleshooting. All the critical RDS servers have Enhanced monitoring and Performance Insights enabled. In parallel, we are also quickly adopting PMM.

In this blog, we will give an overview on using Enhanced Monitoring, Performance Insights and PMM for troubleshooting High CPU usage of RDS MySQL.

RDS provides features like Enhanced Monitoring and Performance insights, which expose OS and MySQL wait metrics to troubleshoot the performance issues once enabled. PMM is an open source platform for managing and monitoring MySQL and MongoDB.  

Enhanced Monitoring

Enhanced monitoring(EM) can produce finer RDS metrics upto a one second interval to capture and troubleshoot transient issues. It also displays the metrics at their sub-component level. We can get the current list of the running processes (identical to top command output), CPU & Memory breakup, average load etc. using EM graphs.
For example: to troubleshoot high CPU issues, the following EM graphs would be helpful.

CPU Total - This is the total percentage of the CPU in use.

CPU System - The percentage of CPU in use by the OS kernel. This would be high in cases such as:

(a)  OS is too busy scheduling the processes (context switching) because the number of active processes is much higher than the vCPU. We may see higher values in Tasks Running and Load Average graphs. To limit the processes we can:

  1. Limit the maximum number of connections.
  2. Make sure  innodb_thread_concurrency is not set too high.
  3. If set to a high value, lower the value for innodb_read_io_threads and      innodb_write_io_threads  

(b)  OS is busy swapping the memory pages, as free memory is low. This can be   confirmed from high Swap In/Out and  kswapd processes using high CPU in OS process list. To reduce memory contention, we can:

1. Limit number of connections.
2. Check if innodb_buffer_pool_size is set too high. It is recommended to be set at about 75% of the available memory.
3. Check if the value of read_buffer_size, sort_buffer_size, join_buffer_size is set too high.
4. Use sys.memory_by_thread_by_current_bytes to determine any high memory usage query and check if it is using any stored procedures or performing expensive sorts/joins.

CPU User - For MySQL RDS, it represents the underlying RDS infrastructure process. If it is high, then it is likely there is an RDS maintenance activity is going on and we may need to contact the AWS support.

CPU Nice - For MySQL RDS, this represents CPU used by user processes or application threads. If this is high, we can take the help of OS Process List to determine the thread id to join it to performance_schema.threads.thread_os_id which will tell us the thread. We can then use sys.processlist to check the SQL statement.

CPU Wait - The percentage of CPU used while waiting for I/O access. If this is high, check if Read+Write IOPS/Throughput is reaching the EBS volume limit or if Disk queue depth is high. The likeliest culprit are the select queries not using indexes or non-partitioned tables. Performance Insight is really helpful in determining which query is experiencing the most waits related to IO. The query explain plan can be checked to determine next course of action.

Steal - The percentage of CPU in use by other virtual machines. This is relevant when using a T2/T3 instance. A high value means the CPU burst credit is utilised and the instance is being throttled.

Performance Insights

We can use the Performance Insights to identify queries that are running on the instance at a particular interval and causing a high load. If Performance insights are enabled, session information is collected, aggregated, and displayed in the dashboard as the Average Active Sessions chart. The Average Active Sessions chart displays the Max CPU value as a line, so you can see if active sessions are exceeding it or not. The Max CPU value is determined by the number of vCPU (virtual CPU) cores for the DB instance.

If the load in the Average Active Sessions chart is often above the Max CPU line and the primary wait state is CPU, the system CPU is overloaded. In these cases, we may want to throttle connections to the instance, tune any SQL queries with a high CPU load, or consider a larger instance class.

As only one of the query is responsible for maximum load in the above case, we can start troubleshooting by looking at the query execution plan and check if it is using correct indexes/join orders.

Performance Insight also helps us in determining which user or application host is causing the high CPU or other wait events.

Using PMM

The Percona Monitoring and Management (PMM) tool can be launched from AWS AMI or Docker Image to collect the monitoring data from multiple RDS sources, which can then be used for performance troubleshooting. It gathers the OS metrics from Cloudwatch and MySQL metrics from Performance schema and other tables and retain it to help with any future analyses.

To troubleshoot the High CPU issue, we can see the overall CPU utilisation under the OS Metrics section.

Here, the most CPU is consumed by the user/application operations (nice) and performing IO (wait).

As the application is consuming the most CPU, we can use PMM Query Analytics (QAN) to check the top queries.

Here Load is the amount of time that the database server spent during the selected time range.  

Count is the average number of requests to the server during selected time range.

And Latency is the average amount of time that it took the database server to retrieve and return the data.

Here it is evident that Query #1 is taking almost all of the resources (96.4%) of the total load.

Clicking on the top query revealed that it had scanned a large number of rows which could be resolved using a suitable index.

Next Step

Once we have figured the offending query, we can further analyse it using Explain. Explain statement provide information about how MySQL executes statements. With Explain, we can check how many rows are being examined and filtered, if the index is being used and the join order. This information can help us devising the correct Indexing and Partitioning strategy to reveal the issue with table/index statistics. More information can be found here.

If the Query optimiser seems to be making a wrong decision, we can use the Analyse table statement. Analyse requires a flush lock so if there are long running statements or transactions still using the table, subsequent statements and transactions must wait for those operations to finish before the flush lock is released.

We can use these tools to diagnose other types of performance issues as well. It is always recommended to first test any solution in the identical Dev/Test environment before implementing the same in the Production.  

In this blog, we have given a bird's eye view of the RDS Performance Monitoring at Halodoc. RDS performance monitoring is a critical & complex task and we continuously strive to make sure that the monitoring is simple, comprehensive and cost-effective.

Scalability, reliability and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for Devops engineers and  if solving hard problems with challenging requirements are your forte, please reach out to us with your resumé at careers.india@halodoc.com.

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 1500+ pharmacies in 50 cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Aditya Sharma

Working at Halodoc as Devops DBA, and my team is responsible for managing 100+ RDS instances.