Kubernetes Disaster Recovery with Velero

kubernetes Jun 8, 2021

Container-centric approach to host applications has various advantages over traditional VMs approach w.r.t performance, cost, flexibility, etc. Kubernetes is one of the widely used tool now a days to manage containerised workloads and services. Backup and restore of data in is a very crucial part in an organisations and Kubernetes as a containerised applications platform supports various tools for data backup and recovery during disaster recovery.

At Halodoc we are using Velero for backing up Kubernetes workloads. Velero is a convenient backup tool for Kubernetes clusters, that compresses and backs up Kubernetes objects to object storage. It also takes snapshots of the cluster’s Persistent Volumes using your cloud provider’s block storage snapshot features, and then restores the cluster’s objects and Persistent Volumes to the previous state during disaster recovery. Velero supports both on-demand and scheduled backups.

In this blog, we talk about our experience using Velero as a Disaster Recovery tool for Kubernetes workloads.

Overview of Kubernetes disaster recovery process using Velero

Backup and Restore workflow:

Whenever we execute Velero backup command in CLI , the Velero CLI makes a call to the Kubernetes API server to create a backup object. The backup controller then validates the backup object i.e. whether it is cluster backup , namespace backup etc. and then it makes a call to the API server to query the data to be backed up . Finally it starts the backup process once it collects the data to be backed up. BackupController then makes a call to the S3 to store the backup file. The backup file is stored as a tar file in s3. Whenever a backup process is initiated , slack notifier controller triggers an alert regarding the backup status whether the backup is in InProgress/Completed/Failed to the respective slack channel.

Similarly whenever we execute a restore command , Velero CLI makes a call to Kubernetes API server to restore from a backup object. Based on the restore command executed, Velero restore controller makes a call to s3 and initiates restore from the particular backup object. The slack notifier controller triggers an alert whenever a restore process is initiated.

How is the Disaster-Recovery process implemented in Halodoc using Velero?

We have explained the disaster recovery process in two steps:
I. Velero configuration: This step describes about the preparations to be done for the disaster recovery process.
II. Backup and Restore process: This step describes about the steps to be followed to backup and restore data for disaster recovery.

I. Velero Configurations:

1.Velero command-line client installation that runs locally where we need to execute Velero commands. Please follow the steps below for Velero command-line client installation:
i.Download the latest tarball for your client platform from the link below:
https://github.com/vmware-tanzu/velero/releases/latest (Example: velero-v1.3.2-linux-amd64.tar.gz)
ii.Extract the tarball with the below command:
tar -xvf velero-v1.3.2-linux-amd64.tar.gz -C /tmp
iii.Move the extracted velero binary to /usr/local/bin using the below command:
sudo mv /tmp/velero-v1.3.2-linux-amd64/velero /usr/local/bin
iv.Verify installation with the below command:
velero version
Output should look something like below:

2.S3 bucket creation to store cluster backup files.
 This bucket saves the copied Kubernetes tarball files during backup.
3.IAM User creation and set permissions for the same IAM user:
 i.Create an IAM User . example: velero-prod
 ii.Create a set of access keys for user velero-prod and attach the policy mentioned below to the user velero-prod so that it has access to the bucket and the EKS cluster to execute the backup.

iii.This user credential is used while installing Velero on EKS cluster
4.Velero server installation on EKS cluster(both production and DR cluster).
  This is required to run the backup commands in EKS cluster to take backup
  velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.0.1 \
  --bucket $velero-bucket-name\
  --backup-location-config region=$AWS_REGION \
  --snapshot-location-config region=$AWS_REGION \
  --secret-file ./velero-credentials

5.Verify velero server installation on EKS by running the below command:
kubectl get pods -n velero

velero version command will give the server version as shown below

6.Configuration of velero-backup-notification : This is a simple Kubernetes controller which sends Slack notifications whenever backups or restores are performed by Velero.
We need to configure it in both Production as well as DR EKS cluster so that we get the notification every time the backup and restore happens.
i. velero-backup- notification is available in the repo mentioned below:https://github.com/vitobotta/velero-backup-notification/blob/master/README.md

ii. Helm definition for velero-backup-notificaion in halodoc:

iii. Confirm the velero-notification-controller is running with the below command:
kubectl get pods -n velero

II. Backup and Restore process:

1.Steps to be performed in production EKS cluster(EKS Cluster of which backup is taken):
i.Switch to production cluster:
kubectl config use-context <prod-cluster-name>
ii.Verify if velero is running with the below command:
kubectl get pods -n velero

iii.Schedule the backup with the below command:
velero create schedule backup-prod-new --schedule="0 22 * * *" --ttl 72h0m0s --ttl 72h0m0s    --ttl 72h0m0s --ttl 72h0m0s --> this flag retains the backup files of last  three days.
We can list existing schedules with the below command:
velero get schedules

iv.Confirm backup is completed with the below command:
velero get backup

2.Steps to be performed in DR EKS cluster (EKS Cluster where data has to be restored):
i.Switch to DR cluster cluster:
kubectl config use-context <dr-prod-cluster-name>

ii.Verify if velero is running with the below command:
kubectl get pods -n velero

iii.Velero is configured in DR EKS cluster with access to S3 bucket

iv.(Optional)EKS DR cluster to be integrated with Production Vault (EC2 instances) (This step is applicable only if the Production cluster is integrated with Vault)

v.(Optional)Check that, in Vault production, new roles are placed for the pods that run in DR EKS cluster.

vi.Ensure CoreDNS, ClusterAutoscaler, ALB Ingress Controller are up and running with necessary configurations and versions in DR EKS cluster.

vii.Any vault roles created for Production EKS cluster needs to be created for DR EKS cluster as well.

3.DR Execution Steps:
i.Verify the cluster context using kubectl and ensure that its pointing to the DR cluster with the command below:
kubectl config get-context

ii.Restore the most current stable backup using the velero client restore command mentioned below:
velero restore create --from-backup <SCHEDULE NAME>-<TIMESTAMP>
iii.Observe pods (applications) are coming up with load balancer endpoints getting created from the controller.
iv.We can see that number of worker nodes is increasing DR EKS cluster is configured with an auto-scaling group with spot fleet.

v.Observe pod logs and ensure no transaction failures with the command below: kubectl logs pod/<pod name> -n <namespace>6.Once the restorations is done in the DR EKS cluster the new load balancer objects are created with different URL hence these newly created endpoints has to be updated in Route53, Cloud-front, Lambda, etc. (every service which consumes these endpoints) so that traffic from the internet can be routed via the new load balancer objects to the DR EKS cluster pods.
vi.The approx. recovery time taken is 60 - 120 MINS.
vii.Screenshot of slack notification:


In this blog, we have explained how to backup EKS cluster resources using the Velero tool and restore the resources from backup during disaster recovery. We have also mentioned how to implement alerting mechanism for every backup and restore process using kubernetes controller.





Join us

We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek and many more. We recently closed our Series C round and In total have raised USD$180million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.