15 SQL Query Optimisation Techniques

Optimisation Dec 28, 2020

Database is an essential software component for any business. It contains organisational data and allows businesses to build features to serve their customers. Here at Halodoc, we extensively use Relational Database Management Systems(RDBMS) as well NoSQL databases for our specific needs.

We have setup active and passive monitoring systems for checking slow queries in the database. We have enabled slow log and we have a performance insight monitor for passive monitoring. For active monitoring, we have alerts on CPU, memory and application behaviour. We have also created custom scripts to trigger alerts when query execution time reaches a particular threshold. These scripts also terminate queries based on predefined criteria.

This blog provides query optimisation techniques for designing good logic and extracting maximum performance from the database. Query logic impacts database server, application server, network, IO and end-user experience.

Query optimisation has many benefits, some of which are listed below.

Minimize production issues: Every slow query requires high CPU, memory, and IOPS. We have seen most production database issues are caused by non-optimised queries.
Performance issues: Slow query means slower response time of applications using the query, which results in poor end-user experience. It is very important to test logic/query with sufficient data before running it in production.
Save infra cost: Unoptimised query requires more CPU, IOPS and memory. Additional load can be put on the same server if queries are optimised.

Basic Considerations for Logic Design

Understand the strengths and limitations of the database systems

With clustering like Kubernetes it is relatively easy to scale application servers but to scale and load balance the database server it is customary to use a "source-replica" structure. This structure enables efficient write-loads at the source and read-loads at replicas.

Creating and maintaining replicas and scaling it up and down is expensive and not as feasible as in application server.

In summary, avoid embedding complex business logic in queries. Try to create light-weight SQL queries - avoiding complex joins and avoid long-running transactions.
Consideration for Data type

The datatype of any column is an important part of the table design and has a significant impact on performance.

For example, using Varchar(36) for storing fixed-length data (UUID) will incur a 20% query executing penalty when compared to using char(36) for fixed-length data. Similarly using char on variable column can increase size of data and size of the index.

Designing table with a basic understanding of datatype like varchar(1000) vs text can give performance benefits. An unnecessary large datatype increases data on the disk and reduces the performance of the indexes .
Index at the time of table design not after production issue:-

The index is a very critical part of table design. In Startups new features go live very frequently. It is very important to create indexes at the time of the creation of the table or before new logic goes into production.
It is also important not to create too many indexes on any table as it can slow down transactions.

Use Explain Plan for Query analysing

Explain Plan is a great utility of MySQL to check query behaviour. It works for select, insert, update and delete query. There are lot of insights we can get from Explain Plan. Below are few important points.

Check index is available or not
Index is used or not.
How many rows are getting scanned for providing results.
Determine ordering operation using filesort or else.
Length of an index.
Determine access type.
Use SHOW WARNINGS after explain give some additional information.

In explain plan example below query executing plan by format=json

explain format=json select user_name from USER where phone_no=12345 order by user_name;

{

"query_block": {

"select_id": 1,

"cost_info": {

"query_cost": "217738.90"

"ordering_operation": {

"using_filesort": true,

"cost_info": {

"sort_cost": "71842.30"

"table": {

"table_name": "USER",

"access_type": "ALL",

"possible_keys": [

"phone_idx"

"rows_examined_per_scan": 718423,

"rows_produced_per_join": 71842,

"filtered": "10.00",

"cost_info": {

"read_cost": "131528.14",

"eval_cost": "14368.46",

"prefix_cost": "145896.60",

"data_read_per_join": "3M"

"used_columns": [

"id",

"user_name",

"phone_no"

"attached_condition": "(`log_for_period_tracker`.`USER`.`phone_no` = 12345)"

}

mysql> show warnings;

+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Level | Code | Message |

| Warning | 1739 | Cannot use range access on index 'phone_idx' due to type or collation conversion on field 'phone_no' |

| Note | 1003 | /* select#1 */ select `log_for_period_tracker`.`USER`.`user_name` AS `user_name` from `log_for_period_tracker`.`USER` where (`log_for_period_tracker`.`USER`.`phone_no` = 12345) order by `log_for_period_tracker`.`USER`.`user_name` |

We can understand how the query gets executed at the database in the following points.

Index is available in “possible_keys”.
Table doing full table scan as “access_type”=”ALL”
Index is not used from show warnings.
Filesort is used for “ordering_operation”.
Row scanning is high from “rows_examined_per_scan”.

Now we can optimise query by changing phone_no='12345' and additional optimisation for order by filesort a composite index on phone_no, user_name can be added as shown in below explain plan.

mysql> explain format=json select user_name from USER where phone_no='12345' order by user_name;

{

"query_block": {

"select_id": 1,

"cost_info": {

"query_cost": "1.20"

"ordering_operation": {

"using_filesort": false,

"table": {

"table_name": "USER",

"access_type": "ref",

"possible_keys": [

"phone_idx",

"phone_no_user_name_idx"

"key": "phone_no_user_name_idx",

"used_key_parts": [

"phone_no"

"key_length": "18",

"ref": [

"const"

"rows_examined_per_scan": 1,

"rows_produced_per_join": 1,

"filtered": "100.00",

"using_index": true,

"cost_info": {

"read_cost": "1.00",

"eval_cost": "0.20",

"prefix_cost": "1.20",

"data_read_per_join": "48"

"used_columns": [

"id",

"user_name",

"phone_no"

"attached_condition": "((`log_for_period_tracker`.`USER`.`phone_no` <=> '12345'))"

}

Query optimisation techniques and Common mistakes

Avoid Using Different Character Encoding:

It requires implicit conversion while making join on tables with different charset. Most of time optimiser does on perform proper implicit conversion of charset and not utilise index. We should use the same encoding across tables to utilise the better indexing feature of the DB.

Both tables below uses different CHARSETS

Table USER

Table CUSTOMER

CREATE TABLE `USER` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`user_name` varchar(25),
`phone_no` varchar(15),

PRIMARY KEY (`id`),

KEY `name_idx` (`user_name`),
KEY `phone_idx` (`phone_no`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `CUSTOMER` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`customer_name` varchar(25),

`order_date` datetime,
`mobile_no` bigint,

PRIMARY KEY (`id`),

KEY `name_idx` (`customer_name`),

KEY `mobile_idx` (`mobile_no`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8

In the example below a.user_name = b.customer_name compares with different encoding and hence it may not utilise indexes even-though index is present and datatype are same for user_name and customer_name.

select a.id,a.user_name,b.customer_name
from USER a inner join CUSTOMER b
on a.user_name = b.customer_name

It requires to alter table to fix charset or need to covert charset as show in below query

select a.id,a.user_name,b.customer_name
from USER a inner join CUSTOMER b
on convert(a.user_name using utf8)= b.customer_name

2. Understanding of comparing columns datatypes:

In the above tables, phone_no is stored as VARCHAR in the first table and mobile_no is BIGINT in the second table. When two columns with different datatypes get compared in a query it might not utilise the index.

As example in below:

select a.user_name,a.phone_no, b.customer_name
from USER as a
inner join CUSTOMER as b
on a.phone_no = b.mobile_no

It is required to alter one of the column's datatype so they match.

3. Understanding of column datatype comparing with values:

In queries, one should avoid comparing different datatype column and value. When a query compares different datatypes, it uses implicit datatype conversion.

For example in the query below, mobile_no is a BIGINT datatype but query comparing as CHAR value.

select customer_name from CUSTOMER where mobile_no=’9876543210’

For better utilisation of indexes, comparison should be done as the same data type as an integer to integer or varchar to varchar.

4. Avoid using Function-based clause in where condition:

When a query uses a function in the where or join clause on the column, it would not utilise the index.

where date(mail_sent) ='2018-01-29'
where lower(user_name)='abc'
where concat(first_name,' ',last_name)='jon bon'
inner join lower(first_name)=lower(user_name)
date(sent_time)=date(now)

One way to optimise query below

select count(email_sent) from EMAIL_TABLE where date(sent_time)=date(now())

is by changing date(mail_sent)=date(now) as following

select count(email_sent)

from EMAIL_TABLE

where sent_time between

str_to_date(concat(current_date,' 00:00:00'),'%Y-%m-%d %H:%i:%s')

and

str_to_date(concat(current_date,' 23:59:59'),'%Y-%m-%d %H:%i:%s'))

The function-based index is available only certain database version like MySQL 8. But if this kind of use of the function is unavoidable in MySQL 5.7 we can create a computed column and create an index on that.

5. Avoid using DISTINCT and GROUP BY at the same time:

The query below has performance overhead when it uses Distinct and Group By together, GROUP BY itself makes rows distinct. Using GROUP BY as in the below query also has one drawback, it will give random values for those columns(b.customer_name) that do not have a group function.

select distinct a.user_name,b.customer_name,count(a.id)
from USER a inner join CUSTOMER b
on a.user_name = b.customer_name

group by a.user_name

There is no need to use DISTINCT and GROUP BY together. This query could be re-written as follows

select a.user_name,group_concat(b.customer_name),count(a.id)
from USER a inner join CUSTOMER b
on a.user_name = b.customer_name

group by a.user_name

6. Avoid using UNION and DISTINCT at the same time:

Union itself makes distinct records, so we need not use DISTINCT with UNION as shown in the query below.

select distinct user_name from USER
UNION
select distinct customer_name from CUSTOMER

7. Avoid selecting unnecessary columns:

Selecting unnecessary columns would be a waste of memory, CPU cycle and network. Every query should select only the required columns for better query performance.

As example query below is select all records

select * from CUSTOMER where id = 123;

it is better to select column name instead of * or extra columns.

select customer_name from CUSTOMER where id = 123;

8. Avoid using Long-running transaction:

When a DML query is executed on a large data range, during the execution time the transaction may lock rows or may lock the full table.
During this lock time the database disallows concurrent transactions and other DML queries would be in a wait state, this similar to small downtime on the table access.
Breaking long running transactions into smaller transaction can be more optimised while making an UPDATE or Delete query.

delete from CUSTOMER where order_date between ‘1990-01-01’ and ‘2020-12-01’
update CUSTOMER set mobile_no=18273278291

where upper(customer_name)=’abc’;

In the above queries the 1st transaction is deleting huge data because of long date range and the 2nd query is not utilising index and responding slower and locking the table during execution time.

9. Avoid subquery where possible:

MySQL and many databases perform slower when there is a subquery in SQL statement query. Subquery creates temp tables to store data and sometimes it creates temp tables on the disk thereby slowing the query execution.
We should avoid such queries as much as possible as it is not scalable.

select a.user_name, b.customer_name

from USER a

join (select customer_name from CUSTOMER where order_date = ‘2020-10-02’) b

on a.user_name= b.customer_name

10. Avoid using Order by in the Subquery:

Using Order By in subquery gives overhead as it unnecessarily file sort subquery data.

select user_name

from USER

where user_name IN (select customer_name from CUSTOMER where order_date > ‘2020-10-02’ order by order_date)

Such Order By makes sense when limiting rows is required.

select user_name

from USER

where user_name IN (select customer_name from CUSTOMER where order_date > ‘2020-10-02’ order by order_date desc limit 10)

11. Avoid using Subquery in IN and NOT IN:

MySQL and many databases perform slower when there is a subquery in SQL statement query. Subquery creates temp tables to store data and sometimes it creates temp tables on the disk thereby slowing the query execution.

In the query below NOT IN (select id from t2) creates a temp table.

select user_name

from USER

where id NOT IN (select id from CUSTOMER where order_date = ‘2020-10-02’)

We can optimise the above query in two ways.

1. select group_concat(id) from CUSTOMER where order_date = ‘2020-10-02’
2. select user_name from USER where id NOT IN ( id get from 1st query)

We can optimise by selecting id in 1st query and use the result in 2nd query as above.
Or we can change the above subquery to LEFT JOIN as below

select a.user_name

from USER a

left join CUSTOMER b on a.user_name=b.customer_name and b.order_date=‘2020-10-02’

where b.customer_name is null

12. Avoid joining tables with OR condition:

Joining tables with OR condition may convert the query into a cross join like query and this query would take a long time and high resources to execute.

For example the query below written with OR

select a.user, b.employee_name, b.manager_name
from EVENTS a
join EMPLOYEE b on
a.user=b.employee_name OR a.user =b.manager_name

can be optimised by making changes to query as shown below by changing OR to UNION ALL.

select a.user, a.event_name, b.employee_name, b.manager_name
from EVENTS a
join EMPLOYEE b on a.user=b.employee_name

UNION ALL

select a.user, b.employee_name, b.manager_name
from EVENTS a
join EMPLOYEE b on a.user =b.manager_name

13. Avoid join with not equal condition:

When a query joins with NOT EQUAL operator it searches all rows and uses a full table scan. Joining tables with != is a bad query type join. We should rewrite the query to use different logic does not use != operator.

select a.user, b.employee_name, b.manager_name
from SALE a join EMPLOYEE b
on a.user != b.employee_name

14. Understanding Transaction and Deadlock:

When a query uses different range indexes and executes simultaneously then it increases the chance of a deadlock.

In the example query below, columns sale_id, item and date have indexes. When these types of queries are executed concurrently, the 1st query may take a lock on those rows that are required to complete the transaction for the 2nd query, and similarly the 2nd query can acquire a lock on those rows required for the 1st query to complete the transaction.

We should avoid these kinds of situations while making DML queries.
As example:

update SALE set name=’sold’ where sale_id='123' and item='abc' and date = '2020-10-03'.

One way to optimise this transaction is to divide the query into two parts: get the primary key by 1st query and put the result in the 2nd query.

select group_concat(primaryKeyId) from SALE where sale_id='123' and item='abc' and date = '2020-10-03'.
Update SALE set name=’sold’ where primaryKeyId in (selectedValues)

15. Avoid Full-Text Search:

When a query searches for keywords with a wildcard in the beginning it does not utilise an index.

select user_name

from USER

where user_name like ‘%abc%’

It is recommended to search using like ‘keyword%’ instead of searching using like ‘%keyword%’ so as to utilise the indexes as shown below. For a full-text search, it is advisable to use the Full-Text Index.

select user_name

from USER

where user_name like ‘abc%’

Conclusion:

We have covered some of the important query optimisation techniques that we have learn over the years and is known to produce good results. There are various other possible optimisation options for queries like usage of where instead of having, use of pagination etc..,

If you diligently follow the query optimisation techniques detailed above, you could benefit from improved query performance, lesser production issues and save cost by minimising resources.

Join us
We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com

About Halodoc
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Recommended for you

Capacity Planning

Structured Capacity Planning and Automation to Optimise Java Services

4 months ago • 9 min read

web automation

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG

4 months ago • 4 min read

NewRelic custom integration for MySQL RDS

Halodoc's Journey to ISO/IEC 27001:2022 - Key Steps in Adopting New Security Requirements

Optimizing Apache Hudi Workflows: Automation for Clustering, Resizing & Concurrency

Upgrading to Dropwizard 4.0.x: A Complete Migration Guide

15 SQL Query Optimisation Techniques

Basic Considerations for Logic Design

Use Explain Plan for Query analysing

Query optimisation techniques and Common mistakes

select a.user_name,a.phone_no, b.customer_name
from USER as a
inner join CUSTOMER as b
on a.phone_no = b.mobile_no

Tags

Himanshu Singh

Recommended for you

Structured Capacity Planning and Automation to Optimise Java Services

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG

NewRelic custom integration for MySQL RDS

Halodoc's Journey to ISO/IEC 27001:2022 - Key Steps in Adopting New Security Requirements

Optimizing Apache Hudi Workflows: Automation for Clustering, Resizing & Concurrency

Upgrading to Dropwizard 4.0.x: A Complete Migration Guide

Basic Considerations for Logic Design

Use Explain Plan for Query analysing

Query optimisation techniques and Common mistakes

select a.user_name,a.phone_no, b.customer_namefrom USER as a inner join CUSTOMER as bon a.phone_no = b.mobile_no

Tags

Himanshu Singh

Recommended for you

Securing the Cloud : How Cloud Security Platforms Handle Threats and Misconfigurations In Real… Time

Structured Capacity Planning and Automation to Optimise Java Services

Boost Web Automation Speed: Parallel Testing with Cucumber & TestNG

select a.user_name,a.phone_no, b.customer_name
from USER as a
inner join CUSTOMER as b
on a.phone_no = b.mobile_no