Vault - High Availability and Scalability

kubernetes Mar 12, 2020

News about data breaches, leaked customer information and stolen passwords for critical infrastructure are becoming very common. Many of these incidents seem to be related to mismanagement of credentials, unencrypted passwords, secrets being pushed to git repositories or secrets being hard coded within the application, leaving no room for rotation.

This has led to increasing demand for Secrets Management tools like

  • AWS Secrets Manager
  • HashiCorp Vault
  • Kubernetes Secrets
  • AWS Parameter Store
  • Confidant

At Halodoc, we analyzed various tools mentioned above and finally decided to move ahead with Hashicorp Vault due to multiple features it offers.

Hashicorp Vault

Hashicorp Vault is an open-source tool that provides a secure, reliable way to store and distribute secrets like API keys, access tokens and passwords. Software like Vault are critically important when deploying applications that require the use of secrets or sensitive data. Vault provides high-level policy management, secret leasing, audit logging, and automatic revocation to protect sensitive information using UI, CLI, or HTTP API. All the secrets stored in it have an associated lease to enable key usage auditing, perform key rolling and ensure automatic revocation. It provides multiple revocation mechanisms to give operators a clear “break glass” procedure in case of a potential compromise. It consists of a unified interface to store secrets while providing tight access control and recording a detailed audit log. It can be deployed to practically any environment and does not require any special hardware such as physical HSMs (Hardware Security Modules).

High Availability

Vault can run in a high availability (HA) mode to protect against outages by running multiple Vault servers. Vault is typically bound by the IO limits of the storage backend rather than the compute requirements. Certain storage backends, such as Consul, provide additional coordination functions that enable Vault to run in an HA configuration while others provide a more robust backup and restoration process.

When running in HA mode, Vault servers have two additional states: active and standby. Within a Vault cluster, only a single instance will be active and handling all requests (reads and writes) and all standby instances redirect requests to the active instance.

Architecture

Cluster Design Considerations

The primary design goal is to the make the Vault highly available (HA) and minimize the downtime.

We started with a small proof of concept using Vault in development mode and executed on the local machine, in order to understand how Vault works.

Running Vault in HA mode requires a storage backend and Hashicorp recommends running a multi-node cluster of Consul. But we did not want to bear the pain of managing the Consul cluster; we wanted a managed and generic storage backend solution.

We decided to use the MySQL(RDS) database for backend as we wanted a generic backend solution that will be available in any cloud, enabling migration to another cloud seamlessly in the future.

For the final setup, we decided to use EC2 instances behind an internal ALB. With this setup the ALB sees the active node as up and the inactive one as down (mentioned in the diagram HTTP status 429 when unsealed); which fits perfectly, since requests will be routed only to the active Vault instance. The HA in Vault, in our case, is intended  to offer high availability and not to increase capacity. By this design, one Vault instance delivers all requests while the rest are hot replicas. If the active node fails, one of the standby nodes becomes automatically active.

Implementation

Primary Node:

ui = true
listener "tcp" {
  address          = "0.0.0.0:8200"
  tls_disable      = "true"
}

storage "mysql" {
  ha_enabled = "true"
  address = "vault-XXXXXXXXX.ap-southeast-1.rds.amazonaws.com:3306"
  username = "user"
  password = "password"
  database = "vault"
  redirect_addr = "http://vault.example.com"
  api_addr = "http://vault.example.com"
  cluster_addr = "http://vault-internal.*com.:8201"
}

Secondary Node:

ui = true
listener "tcp" {
  address          = "0.0.0.0:8200"
  tls_disable      = "true"
}

storage "mysql" {
  ha_enabled = "true"
  address = "vault-XXXXXXXXX.ap-southeast-1.rds.amazonaws.com:3306"
  username = "user"
  password = "password"
  database = "vault"
  redirect_addr = "http://vault.example.com"
  api_addr = "http://vault.example.com"
  cluster_addr = "http://vault-internal.*com.:8201"
}

Target Group:

Health Check:

Scalability

As we onboard more teams and applications to Vault, the usage is going to go up and hence we have to figure out how to scale Vault for the same.

We learned that horizontal scaling of the Vault is not really a solution and one would end up adding multiple standby nodes in the cluster. Also, we can not use standby nodes for read-only purposes as the feature is only available in the Vault Enterprise edition.

Since Vault is typically bound by the IO limits of the storage backend rather than the compute requirements, we need to think through how to scale up the storage backend because every request, for the most part, will end up hitting the storage backend.

As we are using MySQL RDS in Multi-AZ mode, we are assured about the backend scalability. Additionally, we can scale up Vault nodes with more open files, and more memory so they can handle more requests.

Right-sizing

At Halodoc, we have 50+ microservices running in production. The microservices communicate with each other in event-driven fashion using AWS SNS and Lambda. Given that, we have 30+ lambda functions which would be hitting the vault cluster concurrently to get the credentials with 10K RPM.

Since horizontal scalability in Vault is not possible, it is inevitable for us to come up with the right-sizing of Vault nodes which will support almost 20k transactions concurrently. (At least 2 times of current workload)

We were also curious to find what instance type supports how many concurrent transactions. So, we decided to analyze a single Vault node capability by performing  load testing on the Vault cluster.

We started load testing with t2.small instance and below were the findings.

t2.small:

We started with 1k request with concurrent sessions of 1k and it went without any issue.

root@Jmeter-test-server:/home/ubuntu# ab -n1000 -c1000 -H "X-Vault-Token: ***********" https://example.com/v1/kv/data/stage/test
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/


Concurrency Level:      1000
Time taken for tests:   2.421 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      848000 bytes
HTML transferred:       695000 bytes
Requests per second:    413.08 [#/sec] (mean)
Time per request:       2420.823 [ms] (mean)
Time per request:       2.421 [ms] (mean, across all concurrent requests)
Transfer rate:          342.08 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5  810 140.6    817     936
Processing:   106  753 431.3    631    1504
Waiting:      106  753 431.3    630    1504
Total:        118 1563 524.4   1470    2394

Incrementally, we started increasing the connections along with concurrent sessions to check for the performance.

We started getting some failures the moment we reach 4K requests with a concurrency level of 4k.

root@Jmeter-test-server:/home/ubuntu# ab -n4000 -c4000 -H "X-Vault-Token: ******" https://example.com/v1/kv/data/stage/test
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Concurrency Level:      4000
Time taken for tests:   13.670 seconds
Complete requests:      4000
Failed requests:        116
   (Connect: 0, Receive: 0, Length: 116, Exceptions: 0)
Non-2xx responses:      116
Total transferred:      3328780 bytes
HTML transferred:       2716548 bytes
Requests per second:    292.62 [#/sec] (mean)
Time per request:       13669.687 [ms] (mean)
Time per request:       3.417 [ms] (mean, across all concurrent requests)
Transfer rate:          237.81 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5 2946 531.1   2991    3425
Processing:    65 3914 2837.3   3425   10424
Waiting:       65 3914 2837.3   3425   10424
Total:         86 6860 3098.3   6384   13575

Percentage of the requests served within a certain time (ms)
  50%   6384
  66%   7455
  75%  10464
  80%  10779
  90%  11274
  95%  11571
  98%  13492
  99%  13541
 100%  13575 (longest request)

Observation:

From the above exercise, we observed that the t2.small instance is not sufficient to handle 20k requests concurrently.

t2.medium:

We carried the same load testing on t2.medium size instance with some tuning in the system configuration and found the below details.

root@Jmeter-test-server:/home/ubuntu# ab -n20000 -c20000 -H "X-Vault-Token: *******" https://example.com/v1/kv/data/prod/test
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Concurrency Level:      20000
Time taken for tests:   57.060 seconds
Complete requests:      20000
Failed requests:        0
   (Connect: 0, Receive: 0, Length: 6714, Exceptions: 0)
Total transferred:      12941698 bytes
HTML transferred:       9868420 bytes
Requests per second:    350.51 [#/sec] (mean)
Time per request:       57059.741 [ms] (mean)
Time per request:       2.853 [ms] (mean, across all concurrent requests)
Transfer rate:          221.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:    13423 23495 2326.6  24425   25549
Processing:  1395 10077 4504.0  10911   30811
Waiting:     1395 10074 4503.1  10911   30811
Total:      19596 33572 4651.4  34200   54335

Percentage of the requests served within a certain time (ms)
  50%  34200
  66%  36299
  75%  36776
  80%  37437
  90%  38912
  95%  39735
  98%  41028
  99%  48842
 100%  54335 (longest request)

Observation:

We observed that the t2.medium instance is capable of supporting 20K requests with a concurrency of 20K with acceptable response time. Also we have observed the CPU, Memory and system load parameters are well within the limits.

Conclusion

With the following exercise, we have identified that Hashicorp Vault is the most ideal for our use cases and also at arrived the HA set up with right sizing (for our usage) for the same by performing a proof of concept and a load testing. We are excited to have taken the step forward in the right direction with this and hoping for more improvements in the future.

Join us?
We are always looking out to hire for all roles in our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com


About Halodoc

Halodoc is the Number #1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our teleconsultation service, we partner with 1500+ pharmacies in 50 cities to bring medicine to your doorstep, we partner with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalized for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.