Asynchronous communication platform - Eragon

Microservice architecture is a design paradigm where the application is architected as a bunch of loosely coupled, lightweight protocol-oriented, fine-grained services.
At Halodoc we embraced the microservices architecture since the very beginning.

When we break the application's functionality into fine-grained microservices, certain business workflows will not only be confined to a single microservice. There will be workflows that will span across multiple microservices.

To explain with an example, let's assume that there's an organisation which does online food delivery, let's call it Healthy Foods Inc. Healthy Foods, has adopted the micro services architecture and has these following services

Healthy Foods Inc, architecture

With this architecture in place, let's consider a basic workflow - ordering food. Now when a user tries to order food, as part of the order creation workflow, the following steps are involved:

  1. Client (iOS/Android/Web app) makes an HTTP POST /orders to order service.
  2. Order service retrieves the customer information using a HTTP GET /consumers/id to customer/user service.
  3. Order service retrieves the restaurant information using a HTTP GET /restaurants/id to restaurant service.
  4. Order service validates the request using the customer and restaurant information.
  5. Order service creates an order entity and persists it in the database
  6. Order service sends the order response to the client

Now the communication between various services is evident from the above workflow. Synchronous calls to downstream services are blocking the caller. Assuming that the downstream services(customer, restaurant services) have an availability threshold of 99.5%, the availability of the overall system becomes 98.5%. Overall system availability is significantly lower than availability of downstream services. The overall availability will keep on decreasing as the number of participants in a workflow increase (i.e. inversely proportional). This problem is applicable to all forms of synchronous communications, be it REST or gRPC.

Due to these reasons, eliminating all forms of synchronous communication is advised. However, this isn't always achievable in practice. Hence we can take an approximation solution approach and minimise synchronous communication rather than eliminating it. Hence the need for an Asynchronous communication platform which is the central theme of this blog.

Another critical aspect when it comes to communication between microservices is maintaining data integrity across services. Since the data is partitioned, a workflow which spans multiple services will be changing the state across each of them. For example - If the order is abandoned by the customer, and if the state of the order in restaurant service is accepted, the system as a whole isn't in a consistent state. In order to solve these kinds of issues, a workflow is implemented as a series of service-local transactions (a distributed transaction will also do, but it has a performance cost associated). These group of service-local transactions are called Sagas.

Asynchronous communication can be broadly classified as

  1. Orchestration, in this case, there'll be a centralised entity called Saga. This entity's only job is co-ordinating the transactions across services and ensuring that the system is in a consistent state. For example - Let's assume that there's an OrderCreation saga which has a sequence of steps:

    i.  call customer service to check if the customer is authorised
    ii. call restaurant service to check if the order is valid
    iii.create the order in order service with created state
    iv.call payments service to charge the customer for money
    v. If the payment is successful, call the restaurant  service for order acceptance

    These steps form the set of operations in the OrderCreation saga. Among the steps described above, certain steps have a corresponding undo step called as compensating transactions/steps. They are used to restore sane state, if something goes southwards. For example - Assuming the restaurant is busy and couldn't accept the order then, payment has to be refunded, and the order has to be marked as cancelled. The OrderCreation saga is also responsible for executing these compensating transactions as well.
  2. Choreography, contrary to the orchestration type, we don't have any centralised entity who's co-ordinating the service-local transactions in this case. The autonomy is decentralised and all the participating services know how to handle the Events which they receive. Processing of these events drives the Saga to completion. This is analogous to a troop of dancers who are performing live, each dancer is autonomous and knows what step to perform against a particular note.

Design considerations for Eragon
Eragon is the asynchronous communication platform at Halodoc and it is analogous to the nervous system in a human body. In the case of the human body, the brain which sends the signals to all other parts. In the case of microservices, any part can transmit the signals (events). This makes the platform scalable, available and observable.

In addition to these basic features, we also wanted the following characteristics:

  1. Dynamic routing of events, ability to add or remove event routing (producer -> consumer) at runtime without the need for deployment
  2. Emission of events at a particular frequency. Example - every 5 minutes

Entities involved in this platform are:

  1. Event which signifies an occurrence of some kind, which happened due to some action. Examples: order created, order cancelled etc.
  2. Producer produces events. Example: order created event is produced by order service
  3. Consumer an entity who is interested in certain events. Examples include payment service which needs to know if an order is created through order created event.
  4. Channel a medium where the events are stored until the events get consumed or processed by the consumers
  5. Routing table which contains the event's producer-consumer mapping. It is important to note that for a particular event, there can only be one producer but many consumers i.e. there exists an one-to-many relationship between producers and consumers.

High level Design

Kafka
SNS


Low level functionalities

  1. Pluggable channels: Eragon supports multiple channels for events like Kafka/SNS, which act as a medium through which the events flow. Default - SNS.
  2. Pluggable queues: When Eragon is run in asynchronous mode, the events on the producer side get lined up in a queue. A queue consumer polls this queue and takes care of publishing the events. Examples of queues include IQueue, JVM Synchronised queue. Default - JVM synchronised queue.
  3. Dynamic configuration management: Configuration for Eragon is stored in Amazon DynamoDB which is cached in a loading cache (with support for timeouts and manual invalidations). This makes Eragon powerful when publishing event to consumers.

    Sample event object
{
  "event_type": "order_created",
  "entity_id": "b093b94f-6835-4815-9f40-29b13826bf39",
  "data": {
    "order": {
      "id": 30,
      "created_at": 1543309287000,
      "updated_at": 1576051176902,
      "external_id": "b093b94f-6835-4815-9f40-29b13826bf39",
      "status": "created",
      "attributes": [
        {
          "created_at": 1543309287000,
          "updated_at": 1543313202000,
          "attribute_key": "customer_name",
          "attribute_value": "IronMan",
          "language": "english"
        },
        {
          "created_at": 1543309287000,
          "updated_at": 1543313202000,
          "attribute_key": "utm_source",
          "attribute_value": "customer_app/ios",
          "language": "na"
        }
      ],
      "line_items": [
        {
          "item_id": "b093b94f-6835-4815-9f40-29b13826bf39",
          "name": "pizza",
          "price": 239.00,
          "quantity": 2
        }
      ]
    }
  }
}

Event flow

Event Flow
  1. Order service generates an order created event.
  2. order created event gets queued locally. Upon the next publish trigger if the routing details for the event is not present in the cache, fetch the routing information from DynamoDB. This routing information is called Producer destination, which is the event channel(SNS/Kafka) to which the event should be routed.
  3. Routing information gets cached, and the event gets published to the channel.
  4. Each event channel will have a trigger (Lambda in case of SNS, Kafka high-level consumer in case of Kafka).
  5. The trigger will get the consumer destination information for order created event, which is called Consumer destination
  6. Using this Consumer destination information the lambda/consumer will call the event handler in each of the Event consumers, respectively.
  7. If lambda is used as the trigger, then the corresponding lambda will invoke an API in the consumer, the details of the API will be present in the Consumer destinations table.
  8. In case of Kafka high level consumer the consumer will poll the corresponding topic, and this consumer will be running in each of the consumer applications which are interested in listening to the events. This Event to Topic mapping configuration will also be in DynamoDB.

Sample Producer Destination Configuration

Producer Destination Configuration

Sample Consumer Destination Configuration

Consumer Destination Configuration

Learnings

  1. Ordering guarantees: Since the platform is designed to be asynchronous, ordering is not guaranteed. We might have specific use cases where we need ordering. To solve these set of use cases, we fallback to Kafka with a custom partitioner which routes events based on the Aggregate ID / Entity ID. All the events for an entity reach in-order on the consumer side. For example, assume that we need to send out notifications to customers for events like order created, order approved, out for delivery etc. A customer receiving these notifications out of order results in bad user experience.
  2. Idempotent guarantees: The event consumers should be idempotent; otherwise, we might end up with unexpected state due to events getting consumed multiple times. However, this is not the case always in practice. In order to de-duplicate the events, we store the events in DynamoDB again with the partition key being Aggregate ID-Event Type combination. This way, multiple events don't get consumed. Although we pay a cost of deduplication through storage, this is necessary in specific cases.
  3. Local event backups: If in case the producer goes down or is restarted due to various reasons, we should still be able to publish the events which were in the queue when the application comes back. In order to achieve this, we write the events to a file. If in case some of the events are missed, we can replay the event data from the file.

Summary

In this post, we provided some key insights on the need for asynchronous communication when dealing with micro-services. We also briefly discussed our homegrown asynchronous communication platform - Eragon, the internals of it, the guarantees it provides and the various lessons learnt when building it to make it Scalable, Highly Available and Observable.

Join us

We are always looking out for top engineering talent across all roles for our tech team. If challenging problems that drive big impact enthral you, do reach out to us at careers.india@halodoc.com


About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke. We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 2500+ pharmacies in 100+ cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allow patients to book a doctor appointment inside our application. We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates Foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission. Our team works tirelessly to make sure that we create the best healthcare solution personalized for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.