You are on page 1of 47

Assessing the Quality of

Doctor Consultations using ML


Pranjal Aswani
Data Engineer @ Halodoc
Who are we?
PATIENT
S

PAYERS PROVIDERS
PATIENT
S

Halodoc upgrades
relationship between Patients,
Payers and Providers
PAYERS PROVIDERS
Halodoc services
ONLINE OFFLINE
But how are online consultations helping
people?
Number of Physicians
Per 10,000 Population

OECD 2018
Number of Physicians
Per 10,000 Population

30

UK
28.1

OECD 2018
Number of Physicians
Per 10,000 Population

30

UK US Singapore
World
24.5 19.5
28.1 average
13.9

South Korea
22.0

OECD 2018
Number of Physicians
Per 10,000 Population

30 0

UK US Singapore Thailand
World
24.5 19.5
28.1 average 3.9
13.9

South Korea
22.0 Malaysia Indonesia
12.0 3.0

OECD 2018
Healthcare unavailability
because of archipelagic
geography

vs

Halodoc online
consultations
User’s journey for online
consultation and providing
feedback
Consultation user journey
Feedback user journey
Why assess consultation feedbacks?
Consultation feedbacks

As of
December’ 19 50% of the total consultations
are rated
1 Bad
Negative
experience app rating
with a doctor
Customer feedbacks help in providing
actionable feedback to the doctors
How do we assess the quality of
consultations?
Taking help of humans to analyse
consultations
- Manually look at random consultations by different doctors, which contain
feedback and point out the mistakes the doctors are making
- Quantitative: Look at metrics a consultation produces (num messages sent,
length, notes, etc) and flag obvious consultations (0 messages, late response from
doctor, etc)
- Qualitative: Look at how the consultation did on SAPE
- Check if in the end, the patient’s problem was solved or not
Consultation metrics based on “SAPE”
● SAPE is a modified version of SOAP, which is a global standard for assessing consultations

○ Subjective: retrieve information from patient by asking questions


■ Main symptom (high fever), Additional Symptom (body ache)
○ Objective: vitals, measurements (temperature, BP)

○ Assessment: explain the ailment to the patient and why it happened


■ Differential diagnosis (viral fever different from flu), Possible etiology (viral)
○ Planning: necessary steps for the patient to get better and preventative measures
■ Lifestyle modification (rest, light food), recommendation (paracetamol)
○ Etiquette: politeness and empathy towards the patient (replacement for O)
■ Opening (hello) and closing Etiquette (good bye!)
● O is difficult for online consultation and we measure E, instead
How can we use Machine Learning for
this?
Consultation quality with SAPE

Subjective USER COMMUNICATES


THEIR PROBLEM

FOLLOW UP QUESTIONS
FROM OUR DOCTOR
Consultation quality with SAPE … contd

Assessment
DIAGNOSIS BY DOCTOR

Planning MEDICINE
RECOMMENDATIONS
Consultation quality with SAPE … contd

Etiquette DOCTOR ETIQUETTES


Problem statement
● Given an anonymized chat consultation transcript between the doctor & patient,
automate SAPE scoring

● SAPE scores are in the range {0, 0.5, 1.0}

● Goals

○ Actionable feedback to the doctors

○ Improve the consultations quality on the platform

○ Auto summarisation of chat consultations ( => Doctor notes)


Tech Challenges
● NLP in bahasa Indonesia

○ Limited NLP resources

○ NLP with medical terms

○ Translate to English?

● Avoiding bias in the dataset


○ Positive and negative feedback consultations
○ Consultation category
● Training dataset

○ Equally distributed labelled data

○ Oversampling or Undersampling
Evaluated Approaches
- Quantitative (numbers)
- Qualitative (context)
Quantitative-Round 1: using downvoted consultations data

Numeric features based on RED (Responsiveness, Effort and Diligence) scores created
to assess doctors

- Responsiveness: Acceptance Time , First reply time, messages with response time
more than 1 min
- Effort: Doctor-patient message ratio
- Diligence: Notes depth, completed time
Created features based on this domain knowledge
- Average response time
- Average length of message - Chat closed by
- Number of messages sent - eRx issued or not
- Duration of consultation - Doctor patient chat ratio

- Number of questions asked (based on a question classifier)


What kind of data did we start with?
● Team of intern doctors manually labelling the data

● Around 6K labelled dataset


○ Data in google sheets
○ RED scores at a doctor level + SAPE scores at a consultation level

● Bias in the data (only thumbs down consultations were considered)


Find relations between quantitative features (RED scores) + SAPE scores (tags)

- Collate all the available data, clean it and remove duplicates


- Each sub-category is given a 0.5 or 0.0 score (available tags)
- Combine sub-category scores go get scores for the category itself (0, 0.5, 1.0)
- Train decision trees and neural networks using quantitative features
- Hyperparameter (variables of the models which can be messed with for a better
output) tuning
Test accuracies

Production accuracies

A coin toss would’ve given us a better


accuracy score than this
Why didn’t it work?

- Less data (around 7K consultations after cleaning the data, all downvoted)
- Question classifier (a basic dictionary approach) was not that accurate (~40%)
- We had more data for certain scores (1.0) and less for others (0.0, 0.5)
(imbalanced dataset)

How do you make it better?


- Collect better data
- Feature engineering to find more possible features (such as relation of age of patients to the
consultation duration)
- Sampling
- Create a better question classifier (needed for Subjective category in SAPE)
Data Collection

- Clean and better structured data


- Input for RED scores
- Sentence level tagging
- No PII information exposed to
the intern doctors, as compared
to earlier
- Generating reports from the
data for doctors
- Average of ~300 consultations
being tagged per day (as
compared to 100-150 per day)
Quantitative-Round 2: using better data + sampling + better features

- Sampling (over and under, per category) to fix imbalanced dataset


- Created a better question classifier using Support Vector Machine + TfIdf ( 43% more accurate )
- Feature engineering to find more meaningful features
- Age of patient/doctor
- Gender of patient/doctor
- Type of consultation (general, pediatric, OBGYN)
- Re-generate features and fill missing values with averages
- More data
- More enthusiasm!!
Subjective category results on test set

Previous results on test set


Rethink the
features!

Our models failed because the


distribution of our features
across the 3 scores was
almost identical

The models couldn’t find any


patterns because there weren’t
any!

But why didn’t you do it


earlier?
→ Because we only had one
type of data: downvoted
Qualitative: finding context in a consultation

What do we have?
- Sentences tagged at a category level per consultation
- Scores for each of the categories for a consultation (0, 0.5, 1 across S, A and P)

Main idea:
- For each of the categories, there will be words and word pairs (n-grams) which occur only in
sentences of that category
- Exploit this for each of the categories
Pilot: Assessment classifier

- To predict the score for Assessment (0 or 1) based on the chat sentences available
- Total sentences: ~40 K
- Take all A sentences of a consultation and combine them into 1 sentence
- Generate Tf-Idf vector for all those consultation sentences
- Train an SVM to predict the value of A for that consultation as 0 or 1
- ~70% accuracy on test set
Category classifiers
- Create individual category classifiers
to classify sentences as S, A or P
sentences
- Use the classified sentences per
category and feed it to the sub
category classifiers
- Train sub-category (main symptom,
additional symptom for Subjective
category) classifiers to predict scores
0, 0.5 using sentences for that
category
Results?
Category classifier test accuracy

Sub category classifier test accuracy

The final models were chosen


after experimenting with a
dozen different kinds
algorithms
Prod results

Subjective accuracy of 70% (+25% improvement over quantitative techniques)

Assessment accuracy of ~62% (~45% improvement on accuracy)

Planning accuracy of ~57% (~40% improvement over previous algorithms).


Learnings
● Quantity and Quality of the dataset

● Avoid bias in dataset

● Metrics to measure the impact

● Setting expectations with the business stakeholders

● Working with uncertainty


Next steps
- Get more data
- Tag patient level sentences to get the full context of the consultation
- Create better models using word2vec
- Repeat
QUESTIONS?
Excited?
Join us!
careers.india@halodoc.com

www.halodoc.com
blogs.halodoc.io
● https://www.linkedin.com/in/pranjalaswani/
● https://www.linkedin.com/in/rdurgam/
THANK
YOU

You might also like