Assessing quality of Consultations @ Halodoc

Machine learning Mar 14, 2020

Watch the presentation of this post here.


We, at Halodoc, provide tele-consultation as a service on our platform which allows patients to interact with doctors online via chat, audio and video calling services and avail the necessary treatment required for their ailments.

Unlike a physical consultation where the doctor is the primary point of contact and solely responsible for the patient's recovery & well being (in some cases hospitals are also responsible for proper treatment of the patient), when we come to online consultations; Halodoc as a platform also shares responsibility of proper treatment of the patient along with the doctor.

And for these reasons, it is necessary that we maintain a high standard of the quality of consultations that happen on our platform. Currently, we have broken down the quality assessment of a consultation into the following steps:

  • Quantitative: Looking at the consultation level metrics such as length of chat, number messages exchanged, average response time, characters in the notes given to the patient, etc and then separating out the obvious defaulters.
  • Qualitative: by analysing the consultation on 4 categories based on the SOAP convention of treating a patient. We modified SOAP (Subjective, Objective, Assessment, Plan) for our use case and analysed consultations on SAPE instead. Since the Objective (height, weight, pulse, etc) part doesn't fit into the online model of tele-consultations, we replaced it with Etiquette.
  • Quality of treatment: Has the patient’s health concern been resolved?
Once the consultation ends, the user gets an option to rate the consultation, and so far we have a 3.06% bad consultation rate (i.e. 96.94% consultations are rated positively out of the total consultations rated by the users)

Initially when we started monitoring the quality of consultations, our in-house doctors had a dipstick approach in which they went through some of the negatively rated consultations to provide a SAPE score for each of them.
But this approach to quality control doesn't scale well and we thought we could do better by automating the quality assessment process all together. And this came with several advantages:

  • By automating the current manual task of rating consultations, we free up our operations team and reduce costs by not adding more manpower as we scale
  • It enables us to monitor the quality of new doctors on the platform effectively and build solutions for the same by looking at patterns of the older doctors
  • Retrieve any useful data as a by-product of assessing the consultations

What is SAPE?

We mentioned SAPE score before, below are the explanations for each of the individual pieces and how important they are at a consultation level:

  • Subjective: measures if the doctor has asked follow-up questions or more information after the preliminary symptoms that the patient has presented.
    Example: if a patient informs the doctor that they have a fever, follow-up questions would be on the lines of how high is the temperature or for how long, etc. This category is divided into: Main symptoms (high fever) and Additional Symptoms (body ache) subcategories. These give the doctor an in depth insight into what exactly the patient might be suffering from.
  • Assessment: measures if the doctor has provided an adequate explanation of the ailment to the patient.
    Example: After the doctor determines the illness, do they take the time to explain what the illness was and how the patient ended up contracting the same?
    Similar to Subjective, Assessment is divided into: Differential diagnosis (viral fever different from flu) and Possible etiology (viral fever). These inform the patient about exactly what they might be suffering from and how it is different from some other similar disease.
  • Planning: measures if the doctor has provided the next steps for the patient to take. These could vary from taking some rest to a referral to meet another doctor. Again, planning is divided into: Lifestyle modification (take rest, eat light food) and medical recommendation (paracetamol tablets taken twice daily for 3 days).
  • Etiquette: checks if the doctors are empathetic to the patients in terms of their communication and language. This too is divided into: Opening (hello) and closing etiquette (good bye!).
Why look at SAPE in the first place? Because it's a standard followed by doctors across the world and it gives us a good idea of what an ideal consultation should be like.

When the in house doctors tagged consultations on SAPE, they gave a score of 0 or 1 for each of the sub categories mentioned above. This way, we had numerical scores that we could use to tag a consultation as good or bad. For simpler analysis, we transformed the sub category scores of 0 and 1 into 0 and 0.5 so that at a category level we could have scores of 0, 0.5 and 1.0 when we combined both the sub category scores for a particular category.


Since it is easy for a human to look at a consultation and assess the consultation based on the above 4 categories, we needed to understand that process and translate it into a process that a machine could follow and provide us with similar insights for a variety of consultations.

We came up with the following two approaches to solve the problem:

  1. Quantitative: this approach primarily worked with features that we created by talking to the operations team (in-house doctors who have the domain knowledge) and understanding what we should be focusing on. We came up with features such as:
    - average time taken by the doctor/patient to reply (as the patient might get angry on late replies and leave a bad comment)
    - average length of a message (as short answers might not be clear enough)
    - number of characters in the notes provided by the doctor and so on.
    Using these features and the mapped SAPE scores, we tried to figure out if there were any patterns that we could learn and generalise the approach taken by the doctors.
  2. Qualitative: This approach focused on applying Natural Language Processing to parse and tag a consultation like a human would, under a category. For example, the operations team measured the S score by checking if the patient had provided both the main and additional symptoms (and if the doctor had any more follow-ups).

Quantitative approach: Phase 1

Using the features mentioned above, we trained neural networks and decision trees to figure out patterns in the features and corresponding scores (0.0, 0.5 and 1.0) for each of the 4 categories.

When we started, the consultations were being tagged in excel sheets since it was easier to do analysis on them on a consultation as well as doctor level, based on quantitative features mentioned above.
After cleaning and removing duplicates, we were left with about 7000 data samples, all of negatively rated consultations.

For Subjective category, one of the features we thought that was valuable was number of questions asked by the doctor/patient. So we created a question classifier using a dictionary approach which checked if a question word like what, how etc. or the ? symbol were present in sentences to tag sentences as questions accordingly.

The approach for etiquette classifier was different than other categories. We used the above dictionary approach here as well by checking the etiquette specific words (hello, good afternoon, take care, good bye!) in sentences and if those words were found at the end and a beginning of a consultation, we approved the consultation on the etiquette part. The results on production were mediocre at best, so we dropped this category all together.

Feature Engineering

We performed feature engineering to understand which features would contribute the most in determining the score for different categories, i.e which features had the most impact, when assessing a consultation as good or bad.

We looked at different factors which might influence the consultation such as gender of a doctor (assumption being: a female doctor would be more polite than a male doctor), age of the doctor (a young doctor might be able to answer patient queries faster since they would be more adept with using the app than older doctors), the experience of the doctor on the platform, the duration of consultation, the gender and age of the patient, category of a consultation (assumption being that because of platform level constraints the patient might not be able to explain their problems properly in certain category of consultations) and so on

Our final goal was to find the factors on a consultation and a participant level, which showed a definite pattern for separating good consultations from bad, and use them as features for our models.

Creating models and deploying them on production

Feature engineering did give us some good features and we used them along with the already generated features to train decision trees and neural networks.

The best performing algorithms had the following accuracies on the test set:

  • Subjective: 77%
  • Assessment: 75%
  • Planning: 74%

When we deployed these models to production, the results were underwhelming with the best performing model having an accuracy of 40%.


  • Models were trained on downvoted consultations and predictions were made on both downvoted and upvoted consultations which might have caused the poor results.
  • The models seem to predict class 1.0 more often than the other classes, i.e they were biased towards class 1.0 (which might have happened because of the data imbalance)
  • We also thought that these results were bad because of the quality and amount of data we had.

To remedy the last point, we created an internal tool to get better structured and clean data with less null/empty values. By typing in the consultation id into the tool, the person tagging the consultation could see the doctor-patient chat (no PII was shown), consultation notes and all the other necessary information they needed to score the consultation on S,A, P and E . We also thought it would be helpful if we made them mark the sentences which corresponded to a category (S,A,P, E) since this data could be used for NLP techniques later.

We also came up with constraints to only have relevant consultations tagged. Such as, only tag consultations where each participant (doctor and patient) had a minimum of 5 messages. Have at least one consultation per doctor, so that we cover all the doctors in terms of quality. And finally, tag the positively rated consultations as well since the models needed to learn what a good consultation looked like so that they could differentiate them from the bad ones

Quantitative approach: Phase 2

It had been unclear why the results were poor in the first phase. So we set out to find some answers.

  • Imbalanced data: The factor that we thought contributed to biased results towards one class (usually 0 or 1) was that the number of samples for a particular class overpowered the number of samples for other classes. I.e the dataset was not balanced enough with one of the scores being observed more than the others. To tackle this problem, we turned towards sampling the data.
Sampling is a process in which try to balance the number of data points for each of the classes present. We can do this by creating artificial data for the classes which are lower in number and match these numbers to the overpowering class. This is known as Oversampling.

There is another approach called Under-sampling, in which we take out data points from the dominant class, thus reducing the number of samples for that class. But this technique comes with a caveat that when we remove those samples, we loose information from the data, and given that we already had less amount of data, we decided not to go with this approach.
  • Question classifier: we decided to revisit the features and figured out that the dictionary approach to tag sentences was only 38% accurate. Only 38% sentences were marked as questions accurately and since we thought these features were important, we decided to upgrading to a better question classifier.
The data collection tool provided an option to mark each chat sentence as true if it was a question (one of the things that we thought would be helpful, and it was). Hence we trained a simple Support vector machine to classify sentences as questions or not.
The results of the question classifier were good with an accuracy of around 80%. So we decided to move ahead with it.
  • By this time, we had accumulated more data since the ops team had been continuously tagging data using the data collection tool (around ~10,000 more consultations were added to the dataset, apart from the initial 7000)

After regenerating the features and applying sampling techniques, we got the following results for Subjective category:

  • 59.13%  by using Random Oversampling on Gradient Boosting algorithm
  • 60.03% by using Random Oversampling on SVM algorithm
  • 53.68% by using SMOTE on XG boost algorithm
  • 61.54% by using SMOTE on SVM algorithm
  • 53.12% by using SVM-SMOTE on XG boost algorithm
  • and a few more techniques

We didn't pursue these for other categories as we didn't see any improvements for subjective category. This was confusing since we thought that the above approaches would yield good results, but they didn't.

Distribution of features over scores

Unable to understand how to proceed, we decided to plot distribution of all the features for the subjective category over the 3 scores that we were trying to predict. These features were normalised to get comparable plots. The aim of these plots was to better understand if we could find a pattern in the data as well as see which features were the most important. Some of the distributions are shown below:

We can see that the distributions for all the scores look similar in shape, only the number of data points in each of the features differ (from the left: 0.0, 0.5, 1.0)

Patient/doctor average response time also revolves around 0-0.2 (after normalising) for all scores

The above plots made us realise that there are no patterns in the data. Our models couldn't succeed because the distribution of our features across the 3 scores was almost identical.  There were no patterns to be found!

This made a lot of sense and now it was clear that these quantitative features were not going to help us determine the quality score of a consultation. Average response time might be corrupt because of the network latency, the average number of messages might be lower for a consultation but the consultation might be a very good since the doctor might be on point with their answers.

We needed to move away from quantitative features and figure out a way to understand the context of a consultation.

We decided to invest in Natural Language techniques and put an end to the quantitative approach we were taking until now.

Qualitative approach

To start with NLP, we decided to plot the frequency of mono, bi and tri grams of words (n-grams) for all the categories using the chat sentences we had. The idea was that for each of these categories, there will be words which only occur in that category and not others.
For example, Subjective sentences can have more words which imply a question and Assessment category will have more words which conclude to a diagnosis. Using these n-grams, we could see the different words which then could be used to classify sentences into categories.

Context is key

  • As a proof of concept, we tried to create a classifier to predict an assessment score of 0.0 and 1.0 (0.5 scored consultations were dropped) for a consultation by looking at the sentences which were contained in that consultation. This model was trained on A sentences and not A sentences (S+P sentences) with around 120,000 sentences in total.
  • All the A sentences for consultation were combined and inputted into the classifier and the classifier predicted a score of 0.0 and 1.0  for assessment. This classifier was able to predict the scores correctly for assessment with a test accuracy of ~70%.
  • We decided to dig deeper into this technique and ended up creating a flow involving individual classifiers for each of the categories and subcategories.
  • The idea was to calculate a score for each of the subcategories (0.0, 0.5) using sentences in the consultation.
  • All the sentences in the consultations would be divided into the 3 categories Subjective, Assessment and Planning using 3 classifiers trained to predict sentences in those categories (a binary classifier which tells you if a sentence falls in a category or not).
  • The sentences tagged by these classifiers for each of the categories were combined and put into the subcategory classifiers. The job of the subcategory classifiers was to predict the score of 0.0 or 0.5 given a string of combined sentences.

The flow

  • Get all the sentences in a consultation
  • Predict the sentences which fall under Subjective category (for example)
  • Combine all the subjective category sentences as a single string
  • Put this combined string into a subcategory classifier (main symptom and additional symptom) to output a score of 0 or 0.5
  • repeat for other categories

The current deployed set up for quality measurement

The output of the subcategory classifiers was highly dependent on the output of the category classifiers since if the latter didn't perform well, the data going into the subcategory classifier would be bad and hence bad results would be found. Bad data in → bad results out.
But since the category classifiers gave an average accuracy of above 85%, we decided to move forward and create the subcategory classifiers which also had a 70% average accuracy on the holdout set, which was not too bad.

These models then tested on production using ~8K consultations which were tagged by operations team. Overall, we got great results from these algorithms with average Subjective accuracy of 70% (+25% improvement over quantitative techniques), Assessment accuracy of ~62% (~45% improvement on accuracy) and Planning accuracy of ~57% (~40% improvement over previous algorithms).

From Qualitative analysis, one thing is clear that if we want to solve this problem and create generic models which can predict the SAP scores for any type of consultation, we would need to look at the context of the consultation rather than physical characteristics. We are actively researching on how to create better more accurate classifiers and we intend to leverage deep learning algorithms such as LSTMS, Embeddings, etc. for the same.

We are always looking out to hire like minded individuals who have a passion to solve problems like these at scale. If challenging problems that drive big impact enthral you, do reach out to us at

About Halodoc

Halodoc is the number 1 all around Healthcare application in Indonesia. Our mission is to simplify and bring quality healthcare across Indonesia, from Sabang to Merauke.
We connect 20,000+ doctors with patients in need through our Tele-consultation service. We partner with 1500+ pharmacies in 50 cities to bring medicine to your doorstep. We've also partnered with Indonesia's largest lab provider to provide lab home services, and to top it off we have recently launched a premium appointment service that partners with 500+ hospitals that allows patients to book a doctor appointment inside our application.
We are extremely fortunate to be trusted by our investors, such as the Bill & Melinda Gates foundation, Singtel, UOB Ventures, Allianz, Gojek and many more. We recently closed our Series B round and In total have raised USD$100million for our mission.
Our team work tirelessly to make sure that we create the best healthcare solution personalised for all of our patient's needs, and are continuously on a path to simplify healthcare for Indonesia.

Pranjal Aswani

Hi! I'm Pranjal, a data engineer at Halodoc where I am solving complex problems at the intersection of Healthcare and user personalisation using ML. When not working, I like to read books and exercise