Building Reliable Reprocessing Messaging System using SQS
At Halodoc, one of the major offering of our product is providing customers instant consultations with the doctor of choice (based on doctor's availability). A user can select the doctor of his choice from the available list of doctors, once the doctor accepts the consultation, the user can chat with the doctor (text, audio and video) and get instant prescriptions for his consultation.
The problem we were facing
In the above defined case, when the user makes the payment for the consultation with the doctor, the doctor receives a new push notification on his application to accept the consultation request. With the ever increasing daily consultations on our platform, we were facing optimisation challenges to reliably send notifications to doctor with no latency.
We also have built-in algorithm which prefers the doctors who have good acceptance rate and moves them to top and penalises the ones missing many consultations to ensure good user experience. For the case of doctors not receiving notification for the consultation means the system is degrading their score un-necessarily.
We were using the Push Notifications (FCM in case of Android and APN in case of Apple) to send consultation requests to the doctors. Both FCM and APN do not provide the delivery acknowledgement synchronously, so one can never be sure if the push notification was actually delivered or not - in case the response from the APIs was success. There were times when our doctors were quoting the reasons for not receiving the notification as reason for low acceptance rate.
How we solved the problem
We have our own home-grown live messaging system - Live Connect. Since we are already using it for many of our critical use-cases, we decided to use the same on top of APNs and FCMs for the following reasons -
- Live connect supports many web-hooks on different states of the message (delivered, offline-sent etc)
- Live connect persists the status of the client - connected/disconnected (here for more details)
For the above reasons we decided to use Live Connect as first channel for sending the push notifications as with the web-hooks integration for delivery, we can get 100% reliable delivery of push notifications.
Still, How to handle the Undelivered Push Notifications?
Here, we evaluated multiple options - Kafka, time based scripts and SQS. With SQS having built in feature for delayed message processing, it was unanimous choice. Now for every message processed for push notification, we also add the message to SQS with configured delay.
What is SQS?
SQS is a fully managed message queuing service offered by AWS which provides highly scalable distributed managed queuing service that can be used by the applications to publish and consume messages at scale. The key distinguishing factor offered by SQS over other messaging services like Kafka, RabbitMQ etc is its in-built delay-timeout feature which can be very useful for the critical systems where 100% processing of the message is required.
Delay Timeout
Using delay timeout, we can postpone the delivery of new messages to a queue for a number of seconds. For example, when your consumer application needs additional time to process messages. If you create a delay queue, any messages that you send to the queue remain invisible to consumers for the duration of the delay period. The default (minimum) delay for a queue is 0 seconds. The maximum is 15 minutes.
How SQS solves the Use-Case
While processing the messages, every message is also pushed to the delay-queue with the configured timeout in seconds. The message of this delay queue is processed after the elapsed time, the delivery-status of the message is known by that time. In case the delivery callback is not received by the ondeliver webhook, the message is considered to be undelivered and reprocessed by the consumer of this delay queue, which in this case is a AWS Lambda function.
Conclusion
After successful implementation of this use-case, our PN delivery rates have improved and we seldom receive complains from the doctor for not receiving the notifications.
We have plans for using SQS to create a generic pipeline in our communication micro-service - having multiple consumers with defined priority and using the delay queues to ensure one of the consumer is able to process the message. For example, in case of multiple providers for SMS, in case any of the provider fails sending the message, we can use the delay queue processor to send the message using another provider
Further Readings
Join Us
We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com
About Halodoc
Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 3500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, GoJek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.