ProxySQL: Harnessing the Power of Proxies for Database Optimization
At Halodoc, we give top priority to identifying and protecting the Personally Identifiable Information (PII) of our patients and partners. This PII data may include name, phone number, email id, credit card details, address, etc.. Along with masking this PII data from our non-application users like developers, data analysts, etc.., we also wanted to segregate database query traffic efficiently by routing read & write queries to replica and primary instances respectively.
Current database access setup
We have integrated all of our RDS instances with Vault, so that every time developer needs access to any of the databases, he will be able to generate dynamic short lived read only credentials for that database with a Jenkins job, which will call Vault in the backend and generate the credentials. Once the developer gets the database credentials over the email, he can use the short lived credentials and connect to RDS and execute the query.
Current challenges
- Blocking developer's read query on primary instance was a challenge. There are chances, developers may connect to primary instance instead of replica. Running select query can also be problematic if it is not optimised, if it doesn't have proper where clause and running such queries on primary instance can cause an outage. One solution would be to use a different AWS Security Group for primary and replica. Since we have 100+ RDS servers and the use case of these RDS will be different from one another, maintaining multiple security groups was not a solution or option we wanted to explore or maintain.
- Restricting developers from accessing the PII data or hiding the PII data was a challenge when application logic was not written to encrypt PII data before storing in the database and decrypting it before reading it, which required a lot of effort and changes across the system.
- Distributing the database queries among the replica instances to balance the load was challenging.
There were 3 challenges we were trying to solve and we started our proof-of-concept using an open-source tool like MariaDB MaxScale. MaxScale did go well but due to limitations on the maximum number of servers that can be handled on the community edition was limited to 2. Since we had 100+ database servers (AWS RDS) this was not an option which we could deploy across our servers. So we tried another open-source tool available in the market MySql ProxySQL which met all our expectations.
- Supports multiple server access.
- Supports multiple users access.
- Supports data masking.
- Supports query routing.
MySQL Proxy Server
It is a high performance proxy service developed by MySQL. The ProxySQL sits on top of the database. All the queries to an RDS instance will be routed via ProxySQL which then rewrite/reroute the query to the actual database server based on preconfigured checks and constraints. It will also act as a database load balancer.
We solved these challenges with ProxySQL
- We have set default_host_group for developers to use replica instance on ProxySQL mysql_users table, so the developer will be connected to the replica instance as soon as the developer logs in and updated AWS Security Group to allow connection only from ProxySQL CIDR for non-app users.
- We identified all PII columns and wrote query rules to mask the PII information and inserted these query rules into the ProxySQL mysql_query_rules table. So whenever the developer runs any select statement on the PII column, the data will be masked and also we have written a query rule for restricting operation/filtering on these PII columns.
Example: To mask bank_account_number column
Example: To restrict operation on bank_account_number column. Like sum(bank_account_number)
Example: To restrict filtering based on bank_account_number column. Like where bank_account_number = 182028676878
- When there is more than 1 replica instance, we add all the replicas with the same hostgroup_id on ProxySQL mysql_servers table, so ProxySQL can distribute the query load on a round-robin basis.
- So end to end configuration of ProxySQL, loading new changes and user integration is declarative, automated, and immediate without any manual intervention.
Connecting ProxySQL, K8s, Jenkins, RDS & Vault
- We deployed ProxySQL on our Kubernetes(K8s) using Helm Chart in cluster mode with a few customizations to the helm chart as per our need. We added startupProbe on the K8s pod to call a reload config script after the pod is started to auto-load all the configurations related to servers, admin variables, global variables, and query rules to run time and save it to disk. We also added auto pod restart on configuration change using K8s pod annotation reloader.stakater.com.
- The Jenkins job is connected with Vault to generate dynamic RDS credentials. Once the credentials are generated, the credentials will be injected into ProxySQL mysql_users table along with the time stamp of creation and expiration of credentials in the comment column, which can be used later to clean up expired users from ProxySQL. And at the end, Jenkins runs a query on ProxySQL Server for loading the users to run time and save it to disk.
- Developers can now connect to the RDS server via the ProxySQL Load Balancer endpoint and ProxySQL port, which will check authentication, runs all the query rules, and finally pass the query to the RDS server in the background.
Benefits
- We are able to use our primary and replica database instances effectively after distributing the query load.
- We are able to protect PII information.
- We are able to reroute read queries only to replica instances.
Summary
Based on the above benefits we've been able to increase productivity of our database users and enhance our security by masking PII data of our customers.
References
- https://proxysql.com/documentation/
- https://www.vaultproject.io/use-cases/database-credential-rotation-with-vault
Join us
Scalability, reliability, and maintainability are the three pillars that govern what we build at Halodoc Tech. We are actively looking for engineers at all levels and if solving hard problems with challenging requirements is your forte, please reach out to us with your resumé at careers.india@halodoc.com.
About Halodoc
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 4000+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek, Astra, Temasek and many more. We recently closed our Series C round and in total have raised around USD$180 million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.