VI Sem Project - Instant Access To Healthcare Using AI - Voice Enabled Chat Bot - Group No - 3

P r o jec t Report
on
Instant Access to Healthcare using AI - Voice Enabled Chat Bot
Submitted to
Shri Ramdeobaba College of Engineering & Management, Nagpur
(An Autonomous Institute Affiliated to Rashtrasant Tukdoji Maharaj Nagpur University)
for partial fulfillment of the degree in
Bachelor of Engineering
(Information Technology)
Sixth Semester
by
S IMRAN S INGH (23)
P ARTHSARTHI P AHUJA (54)
Y ASH G UPTA (70)
Under the Guidance of
Dr. D.S. Adane
Department of Information Technology
Shri Ramdeobaba College of Engineering & Management,
Nagpur-13
2020-21
CERTIFICATE
This is to certify that the Project Report on
Instant Access to Healthcare using AI - Voice Enabled Chat Bot

is a bonafide work and it is submitted to
Shri Ramdeobaba College of Engineering & Management, Nagpur
(An Autonomous Institute Affiliated To Rashtrasant Tukdoji Maharaj Nagpur University)
by
Simran Singh, Parthsarthi Pahuja, Yash Gupta
For partial fulfillment of the degree in
Bachelor of Engineering in Information Technology,
Sixth Semester
during the academic year 2020- 21
under the guidance of
Dr. D.S. Adane

Professor, Department of Information Technology, RCOEM, Nagpur
Dr. D. S. Adane Dr. R. S. Pande

Head, Department of Information Technology Principal
RCOEM, Nagpur RCOEM, Nagpur
Department of Information Technology
Shri Ramdeobaba College of Engineering &
Management, Nagpur-13
2020-21
ACKNOWLEDGEMENTS
It is our proud privilege to present a project report on “Instant Access to Healthcare

using AI - Voice Enabled Chat Bot". We take this opportunity to express our
deep sense of gratitude & whole hearted thanks to our guide and head of the department
Dr. D.S. Adane, Department of Information Technology, Shri Ramdeobaba College of
Engineering and Management, Nagpur for his valuable guidance, inspiration and
encouragement that has led to successful completion of our project.
A special word of thanks goes to Entire Department of Information Technology, RCOEM,

Nagpur for their encouragement and their cooperation to accomplish our work on time.
Finally, we would like to thank and express sincere gratitude towards our Principal
Dr. R.S. Pande for being our source of inspiration throughout this project. We would also
like tothank each and every member involved in the completion of this project.
Name of Projectees
Simran Singh (23)
Parthsarthi Pahuja (54)
Yash Gupta (70)
i
CONTENTS
Page No.
ABSTRACT iii
LIST OF FIGURES iv
LIST OF TABLES v
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION TO CHATBOT 1
1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE 2
1.3. FUTURE SCENARIO FOR INDIA 5
CHAPTER 2
OVERVIEW OF HEALTHBOT
2.1 CHATBOTS IN HEALTHCARE INDUSTRY 7
2.2 USE CASES IN HEALTHCARE 8
2.3 CHALLENGES AND LIMITATION 10
CHAPTER 3
LITERATURE REVIEW
3.1 SURVEY OF EXISTING MODELS 12
CHAPTER 4
AIM, OBJECTIVES &
METHODOLOGY
4.1 PROBLEM STATEMENT 15
4.2 PROPOSED SOLUTION 15
4.3 METHODOLOGY 15
CHAPTER 5
NATURAL LANGUAGE PROCESSING
5.1 INTRODUCTION TO NLP 18
5.2 NLP TECHNIQUES 19
5.3 IMPLEMENTATION 20
CHAPTER 6
MACHINE LEARNING
6.1 INTRODUCTION TO ML 23
6.2 RESEARCH ON ML ALGORITHMS 25
CHAPTER 7
DATABASE
7.1 DATA IN HEALTHCARE 37
7.2 DATABASE DEVELOPMENT 37
CHAPTER 8
CONCLUSION AND FUTURE WORK 41
REFERENCES 42
ANNEXURE I
ii
ABSTRACT
With the current growth in the interest of individuals in health, life care, and disease, medical
institution services had been moving from remedy awareness to prevention and fitness
control. The clinical enterprise is growing extra offerings for fitness- and lifestyles-
merchandising programs. This trade represents a clinical-provider paradigm shift because of
the extended lifestyles expectancy, aging, life-style adjustments, and profits increases, and
consequently, the idea of the clever fitness provider has emerged as a first-rate issue.
However, as the quantity of information is growing and the clinical-information complexity is
intensifying, the constraints of the preceding strategies are an increasing number of
problematic. With the incoming trends in technology, AI chatbots have managed to pave their
way in healthcare domain. Although healthcare was not the first sector in which experiments
with chatbots have been carried out, since the beginning of 2018 we have seen the emergence
of and experimentation with many different use cases in this field. A chatbot is an intelligent
conversation platform that interacts with users via a chatting interface, and since its use can
be facilitated by linkages with the major social network service messengers, general users can
easily access and receive various health services. The layout of the framework contains the
subsequent three levels: Natural language Processing, Machine Learning and Database. This
is followed by focusing on two Machine Learning algorithms, Random forest and KNN
which are supervised learning algorithm taking user input and providing diagnosis based on
the information stored in the knowledge base of the system. Currently the project is in
development phase with the algorithm being tested on ten diseases and the future plans have
been stated.
iii
LIST OF FIGURES
Sr. No. Description Page

No.
Figure 1.1 Example of conversational bot 2
Figure 1.2 Use cases of bots in AI 4
Figure 2.2.1 Chatbot checking symptoms 8
Figure 2.2.2 Suggesting healthcare service 8
Figure 2.2.3 Chatbot suggesting Medication 9
Figure 2.2.4 Booking appointment using Chatbot 9
Figure 4.3.1 Chatbot Architecture

16
Figure 5.1 NLP working 19
Figure 5.3.1 Speech recognition code 21
Figure 5.3.2 Text Pre-processing code 21
Figure 5.3.3 Output of NLP Methods 22
Figure 6.3.1 Execution of Random Forest 33
Figure 6.3.2 Sample input to the code 33
Figure 6.3.3 Output of the following code 34
Figure 6.3.4 Execution of K-Nearest Neighbor 35
Figure 6.3.5 Sample input to the code 35
Figure 6.3.6 Output of the following code 36
Figure 7.3.1 Code for Web Scrapping 39
Figure 7.3.2 Code for Exporting Scrapped Data to CSV File 39
Figure 7.3.3 Snapshot of Cleaned Training.csv File 40
Figure 7.3.4 Snapshot of Cleaned Testing.csv File 40
iv
LIST OF TABLES
Sr. No. Description Page No.

Table 6.2.1 The difference between supervised learning and 30
unsupervised learning
Table 6.2.2 Summary of the reviewed ML algorithms. 31
v
CHAPTER-1
INTRODUCTION
1. INTRODUCTION
This chapter basically introduces about chatbot like what exactly is a chatbot, history of
chatbots, various types of chatbot, usage of chatbot, application of chatbot in the field of
healthcare, use of Artificial Intelligence in Medical sector followed by the future scope of AI
based chatbots in the healthcare sector.
1.1 INTRODUCTION TO CHATBOT

A chatbot is a computer program that simulates and processes human conversation (either
written or spoken), allowing humans to interact with digital devices as if they were
communicating with a real person. Chatbots can be as simple as rudimentary programs that
answer a simple query with a single-line response, or as sophisticated as digital assistants that
learn and evolve to deliver increasing levels of personalization as they gather and process
information.
1. What is a chatbot?
Several million people enter keywords every day in search engines such as Google and then
have to choose from a list of results, usually in the form of web pages in which it is again
necessary to search for specific information.
A chatbot is a software robot that can reproduce natural language and interact with an individual
through automated conversations. Chatbots allow you to receive a unique answer or a service. In
the literature, chatbots and conversational agents can be distinguished according to their level of
understanding of natural language, the former using keyword or rule engines instead, while the
latter are based on machine learning. We shall use the term chatbot in its generic sense in this
white paper. The operating model of a chatbot is always the same, whatever its scope, its theme
and its level:
 Users formulate their queries in natural language via a voice or text interface.
 The chatbot receives the request and its engine interprets it to understand it.
 The chatbot provides a unique and qualified answer to the user’s query.
The answer may be generic (i.e. the same for everyone), contextualized (adapted to the context,
for example, at a given time and place) or customized (adapted to users, for example, by
providingthem with their bank balance).
1
There are three types of chatbot:
Assistants: Provide the user with a predefined answer like in a page for "Frequently Asked
Questions".
Concierges: Provide a contextualized response and facilitate a service to the user, for example
byexplaining the steps of an action to be taken.
Advisors: Integrate customized answers to complex requests with automated processes to

performcertain actions.
Figure 1.1: Example of conversational bot
2. History of Chatbots
Chatbots are in the spotlight today, but the first chatbot emerged in 1964 with ELIZA. Several
chatbots have been tested to try to understand and reproduce the human ability to conduct a
conversation, through research on artificial intelligence in computer science. Other noteworthy
chatbots were then created with Jabberwacky in 1982 and A.L.I.C.E. in 1995 for example. Since
2010, the web giants have been launching smart assistants for smartphones and PCs to improve
the user experience. The best known is Siri, launched by Apple on the iPhone in 2010.
2
Then there was Google Now in 2012, Cortana at Microsoft and Alexa at Amazon in 2014. Since
2016, chatbot solutions have been multiplying, particularly on Facebook Messenger, thanks to the
simplificationof chatbot technologies and implementation tools that anyone can use.
1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE
Artificial intelligence (AI) and machine learning solutions are transforming the way healthcare
is being delivered. Health organizations have accumulated vast data sets in the form of health
records and images, population data, claims data and clinical trial data. AI technologies are well
suited to analyze this data and uncover patterns and insights that humans could not find on their
own. Through AI, healthcare organizations can use algorithms to help them make better
business and clinical decisions and improve the quality of the experiences they provide.
1. What is Artificial Intelligence?
“Artificial Intelligence is neither a new technology nor a machine”. Artificial intelligence is the
recognition of outcome-direction which is the rapid analysis of live data to achieve the expected
goal. Outcome-directed thinking splits from the confines of the rule-directed approach that is
accomplished through artificial intelligence. The generalized practice of AI can be broken down
into a straightforward process. First of all, a numerical representation is established for the target
or outcome. Specific data is then associated with the target is gathered and conditions and
behaviors are investigated to increase the likelihood of achieving the expected target. Multiple
aspects can determine the outcome. The weight of each aspects effect is computed. “AI uses the
relative weighting of each aspect to create a prediction (evaluation) formula” (Yano, K. 2017).
Lastly, the formula devised from the weighted aspects are employed to business decisions. AI
can be classified into four groups: “systems that think like humans, systems that act like
humans, systems that think rationally and systems that act rationally”. AI is generally
categorized as strong and weak AI: strong AI is the production of human-like intelligent
systems. Weak AI would be the integration of intelligent algorithms embedded within a system.
“Machine learning, deep- learning, natural language processing and neural networks are often
summarized under the term of AI”.
2. Artificial intelligence in medicine

The application of AI in medicine has two main branches: Virtual branch and Physical branch.
3
Virtual branch –
The virtual component is represented by Machine Learning, (also called Deep Learning)-
mathematical algorithms that improve learning through experience. Three types of machine
learning algorithms:
1. Unsupervised (ability to find patterns)
2. Supervised (classification and prediction algorithms based on previous examples)
3. Reinforcement learning (use of sequences of rewards and punishments to form a

strategy for operation in a specific problem space)
Physical branch –
It includes: Physical objects, Medical devices, Sophisticated robots for delivery of care
(carebots)/ robots for surgery.
Figure 1.2: Use cases of bots in AI
3. Applications of Artificial intelligence in Healthcare
 AI can assist physicians by providing better clinical decisions with high accuracy,
replace human judgement in certain functional areas of healthcare (eg, radiology).
Knowledge base updation is not a tedious task as compared to learning of human beings.
 Chatbots are available 24x7 so patients have access to healthcare facilities whenever they
want.
 Early diagnosis - AI based chatbots are capable of diagnosing some fatal diseases like
breast cancer, heart diseases and diabetes etc. at an early stage.
 AI chatbots can predict the outcomes of the disease as well as treatment for the disease.
4
 Treatment of some rare diseases like Parkinson’s disease, where Parkinson’s disease is a
disorder in the brain that results in stiffness, shaking and troubles in carrying out simple
tasks like balancing, coordinating and walking.
 Reduce diagnostic, therapeutic and human errors.
 Increased patient safety and huge cost savings associated with use of AI.
 AI system extracts useful information from a large patient population which ultimately
makes the system mire intelligent and this is not possible for humans to keep so much
knowledge when compared to the data stored in the AI based system.
1.3 FUTURE SCENARIO FOR INDIA
 Looking at the increased usage of healthcare chatbots nowadays will definitely lead to
the collaboration between medical and technical institutions in the near future.
 Stop working in silos – Working in silos represents people, team or companies working
towards the same objective, often in close vicinity but not sharing information - people
not talking to other people - and this leads to wasted time and cost, not to mention missed
opportunities, so in healthcare sector if chatbots are developed then that will breakdown
human silos.
 Government funding can be used to make more intelligent and result oriented chatbots.
 Current status of medical records
 Incommunicable silos of wasted information for the health system and for knowledge
acquisition. Laboratories and clinics need to collaborate to accelerate the
implementation of electronic health records
 Data need to be captured in real-time, and institutions should promote their

transformation into intelligible processes.
 New scientific and clinical findings should be shared through open-source, and
aggregated data must be displayed for open-access by physicians and scientists and made
automatically available as point-of-care information.
 Integration and interoperability including ethical, legal and logistical concerns are
enormous.
5
 Simplification, readability and clinical utility of data sets.
 Each result must be questioned for its clinical applicability.
 Aim of increasing their clinical value and decreasing health costs
 Electronic medical or health records
 Are essential tools for personalized medicine.
 Early detection and targeted prevention, again.
6
CHAPTER-2
OVERVIEW OF
HEALTHBOT
2. OVERVIEW OF HEALTHBOT
This chapter introduces about healthcare chatbot, importance of chatbot as an alternative of
Medical Professional. The subsequent sections snapshots of some use cases of existing
healthcare chatbot followed by the limitations of chatbots in medical sector.
2.1 CHATBOTS IN HEALTHCARE INDUSTRY

Some of the prominent use of chatbots includes appointment booking, personal counselling,
medical test booking, doctors feedback, home healthcare services, medicine ordering, doctor or
hospital appointment scheduling.
1. Healthcare Chatbot
Although healthcare was not the first sector in which experiments with chatbots have been
carried out, since the beginning of 2018 we have seen the emergence of and experimentation
with many different use cases in this field. The chatbots thus try to handle several needs, such as
personalized medical follow-up, communication and transmission of test results, dissemination
of information, or even advice to patients or preliminary diagnosis. It is in this context and based
on the project initiated by Sanofi, in partnership with Orange Healthcare and Kap Code, that we
are exploring in this white paper some practical cases of healthcare chatbots and the specificities
of the healthcare sector. The white paper also includes our proposals for evaluating user
perception of these new digital tools.
2. Proposing Chatbot as an Alternative System

The use of chat-bots has spread from consumer customer service to matters of life and death.
Chatbots are entering the healthcare industry and can help solve many of its problems. Chat-bot
is a computer program designed to carry on a dialogue with people, particularly on the Internet.
It assists individuals via text messages within websites, applications or instant messaging and
enables businesses to attract, keep and satisfy clients. This kind of bots is an automated system of
communicating with users. There are chatbots which can provide information to the following
and similar to them questions. “How long is someone infectious after a viral infection?” “How
can I get a prescription?” “How can I find out my blood type (blood group)?” Thereby, clinics
building a chatbot for their sites, lower the number of repetitive calls that their specialists have
to answer. This, in its turn, enables hospital employees to concentrate on more significant tasks
which will lead to better healthcare service quality. The proposed system will not only provide
the personal assistance to the patients but also users can keep their previous medical record on
the platform for future use. The platform will provide a conversational experience to patients
acting like a doctor is treating them online.
7
2.2 USE CASES IN HEALTHCARE
This section explains the working of an existing healthcare chatbot. Working of the chatbot is as
follows:
1. Checking Symptoms
Plugging a collection of symptoms

into a search engine can yield unclear
or unnecessarily alarming results.
Chatbots can ask clarifying questions
and factor in personal details before
offering advice. They can also
identify when a person might need
urgent care and pass along chat
transcripts to providers so that
patients don't have to repeat
themselves.
Figure 2.2.1 Chatbot checking symptoms
2. Finding health services
Finding health services that are close by and in

your care network can be difficult. Chatbots can
personalize their responses based on account
information and use location data to find the
nearest relevant services.
Figure 2.2.2 Suggesting

healthcare service
8
3. Medication Guidance
Chatbots aren't replacements for

pharmacists but they can be handy for
sharing basic drug information and
reminding patients when to take their
medication. Chatbots can interact over
web, social, SMS, and even through
your mobile app so your customers
will always see the reminder.
Figure 2.2.3 Chatbot

suggesting Medication
4. Book an appointment
Scheduling Appointments Getting time

with your practitioner is typically done
through a phone call. But with demand for
digital options increasing, a chatbot that
can book appointments might be just what
the doctor ordered. They can hook into
your existing scheduling tools or, if you
already have online appointment booking,
host that service inside the chat window.
Figure 2.2.4 Booking appointment

using Chatbot
9
2.3 CHALLENGES AND LIMITATIONS
1. Obstacle for AI chatbot in the Future

One of the main hurdles for Al would be its adoption. Healthcare professionals would have to
educate about the need for Al. They should also be made comfortable for work in an
environment where Al is present. Many doctors would not be open to the information provided
by a machine, and they would be educated to accept Al. Compliance and FDA regulations can
be another major problem. Currently, with Al being only partially understood, the amount of
importance that has to be given Al would also be a question that lurks in the minds of the FDA
personnel.
2. Difficulties in healthcare AI adoption

The industry is receptive to new ways to improve diagnostics, patient care, and financial
efficiencies. However, these AI healthcare companies contend with some significant challenges
with regards to widespread Al adoption in the healthcare.
 Case study conundrum
 Black box issue
 Stakeholder complexities
 Current trends
3. Other challenges and limitation

Giving human intelligence is almost impossible, Time constraints, Enough knowledge
representation, Should be very specific keyword, Technological limitation of Al, Medical
limitation, Ethical challenges, Better regulations, Misconceptions and overhyping Human
rejection.
4. Data safety and privacy and risk
The ministry of health and family welfare is working on a sector -specific legislation, tentatively
called the healthcare data privacy and security act. In 2016 , the hacking of a Mumbai — based
diagnostic laboratory database led to the leaking of medical records (including HIV reports of
over 35000 patients). Hacker can exploit Al solutions to collect private and sensitive information
such as electronic health record.
10
5. Common vulnerabilities addressed in chatbot
 Man-in-the-middle
 Chat log stored on user device
 Encryption of messages in transit
 Encryption of data at rest
 Use of external NLP services
 Logging and access rights
11
CHAPTER-3
LITERATURE REVIEW
3. LITERATURE REVIEW
Chatbot in healthcare is a system which assists users to know about their disease, give
treatment related to the disease or give information about the nearby healthcare center in a
cost effective and efficient manner. Most of the researchers have used techniques such as
NLP, ML to predict the disease but the difference arise when it comes to machine learning
algorithms and some novel functionalities. The research work is done from verified journals
or research papers which are either SCI or Scopus certified journals or research papers.
Through the research work it was analyzed that there are various techniques to build, train and
deploy the chatbot some of the analysis which was done are listed below.
4. SURVEY OF EXISTING MODELS

While performing survey of around 10 articles we learnt about various technologies used in
healthcare chatbots. We came across the different algorithms and methodology used by
researchers and based on their used cases, challenges and difficulties they faced while
developing the system we started working on our project in such a manner so that we do not
perform the same mistakes as some researchers did in the early stages of development. Also
from the already developed project we learnt how various Machine Learning algorithms like
Random Forest, KNN, Decision Tree and SVM etc. and NLP techniques which are used in
speech-to-text and text-to-speech conversion.
[1] This paper aims to offer solution based on microservices architecture for chronic patient
support and provide eHealth functionalities and a virtual assistant was developed which was
based on most common diseases. Some novel functionality like speech recognition was to be
added on this project.
[2] This was a paper in which research work was done and the researchers analyzed about the
topics like Understanding use of chatbots in Healthcare, AI hesitancy, Motivations of
healthcare chatbots also the researchers raised issue regarding the accuracy and the security
concerns of the chatbot. The drawback concluded from the paper was the researchers didn’t
focused on any particular population and they only explored the general views on healthcare
chatbots.
[3] The developers were mainly concerned about the unavailability of doctors and healthcare
services during the COVID-19, so they developed an AI based chatbot that will provide
medical consultation to end user. The bot consisted of two major modules that are extracting
the information form the user through voice signals and provide medicinal remedy to user by
12
extracting information from the user query through tokenization technique. One of the with
this model was Data authenticity as the sources of data were not specified so including Deep
Learning concepts might increase the accuracy and efficiency of the model.
[4] This project aims at providing basic consultation to a user before consulting a doctor.
The chatbot identifies the symptoms and categories it as major or minor symptoms and if it is
a major one the chatbot suggests the user to consult a doctor. NLP and decision tree algorithm
wasused by the developers to provide diagnosis.
[5] The chatbot was based on Supervised Learning method and methods like NLP and
Decision Tree Algorithm was used. The chatbot provided diagnosis based on the symptoms
entered by the user. It also consists of functionalities like the chatbot can connect the user to a
Doctor and if the doctor is unavailable then preliminary consultation is provided by the
chatbot. The disadvantage of this model that it worked with only limited number of disease
and accuracyis low for uncommon diseases.
[6] In this paper the researchers proposed a chatbot in which they wanted to develop a virtual
assistant that can measure the infection severity and connects the patient to a doctor if the
situation becomes serious. Also the chatbot can check whether the user is suffering from
COVID-19 if the user is suffering from COVID-19 then it tells the user to consult a doctor
and if user is not suffering from the infection then the chatbot provides basic safety measures
the user should follow in order to be safe.
[7] The paper proposed of a model in which the chatbot asks user for the symptoms and based
on the analysis the chatbot gives diagnosis. Methods such as JAVA language, NLP and ML
algorithms were used. The main drawback of the system was that the developers didn’t check
the accuracy of various ML algorithms they just finalized the first algorithm they checked.
[8] This project focused on the physical fitness of the user, it asks the user to enter their
height and weight based on that the chatbot calculates the BMI of the user and identifies
whether the user is underweight or overweight. The chatbot can also provide the diet plan to
the user, it uses NLP and mainly focused on Morphology. The drawback of this system is that
the input from the user is not in sequential order which may lead to incorrect response
collection.
[9] This system was designed in python and is able to diagnose using a direct approach of the
question and answering technique to suggest a medical diagnosis. The developers extracted
data from different standard websites for building their knowledge base.
13
The entire project was deployed in Telegram apk. The drawback of the system was it was not
secure the false positivecases of falsely suggesting disease.
[10] A medical Chatbot that provides diagnosis and remedies based on the symptoms
provided to the system. The system will be able to measure the seriousness of the diagnosis
andif needed, it will connect the user to a doctor available online. The limitation of the project
is only 56.6% which is quiet low.
14
CHAPTER – 4
AIM, OBJECTIVES &

METHODOLOGY
4. AIM, OBJECTIVES & METHODOLOGY
This chapter is mainly divided into three sections. First section defines the problem statement
we have come up with to tackle the medical challenges faced by the people. In second
section, we will see the proposed solution in brief for the given problem statement. In the
third section, our proposed methodology is explained in detail.
4.1 PROBLEM STATEMENT
In rural areas especially in India, faces a lot of challenges like expensive medical care, lack of
infrastructure or absence of doctors. They have to travel long distances to get a medical assist.
There are many more such challenges faced by the people which are compromising the
human’s life. To overcome this, we come with a problem statement stated as “Instant Access
to Healthcare using AI - Voice Enabled Chat Bot”.
4.2 PROPOSED SOLUTION
For the given problem statement, we propose an “AI - Healthcare Chatbot” which will
provide an instant solution.
 The chatbot will provide a diagnosis to the user based on the symptoms they will provide.
 The chatbot will provide assistance to the users in emergency situations. For example, if
there is a diagnosis of severe chest pain or heart attack based on the user’s symptoms, the
chatbot will immediately suggest seeking medical attention right away.
 The chatbot will also offer solutions for non – severe medical issues. These solutions can
be in the form of say to do gargling when diagnosis with common cold.
 The chatbot will also provide details of the medical to be taken for the diagnosed issue.
 Place like India where people are more comfortable with Hindi language, we will have the
feature of Hindi language where user can interact in Hindi with the chatbot. This will ease
the use of chatbot.
4.3 METHODOLOGY
In this section, we will see the complete project structure, i.e. the chatbot architecture in
detail. We will discuss the working and importance of each phase and components of the
architecture.
15
1. Chatbot Architecture
Figure 4.3.1: Chatbot Architecture
The figure 4.3.1 shows the complete architecture of the chatbot designed as per the proposed
solution as discussed in the previous section. The components of the architecture are:
 Users
 Messaging Platform
 Voice Agent (Microphone)
 Speech Recognition (Speech to text)
 Natural Language Processing (NLP)
 Machine Learning (ML)
 Database
16
2. Phases and their Working
The working of the designed architecture can be classified into three different phases:
 Interaction with user
This phase deals with the users, messaging platform and speech recognition component of
the chatbot. The phase focuses on the conversation with the user. Using the messaging
platform (GUI of chatbot), the user can interact with the chatbot. Interaction with chatbot
can be done through the voice message or can type the input as text message. For voice
input, the chatbot will convert the voice message into text for further process. If input is
text, then it is directly transferred to the NLP component of the architecture.
 Processing the Query
This phase deals with the NLP component of the chatbot. Text preprocessing is done with
the help of NLP in this phase. The input will go through the various NLP techniques like
tokenization, stemming and removal of stopwords to clean the input data. The output of
this phase will be the extracted keywords, i.e. symptoms. These symptoms will be
transferred to next phase.
 Predicting the output
This phase deals with the ML and database component of the chatbot. Prediction is done
with the help of ML algorithms in this phase. The input will be fed to the machine learning
algorithm so that they can predict the corresponding disease as per the user’s symptoms. It
will build a ML model using the actual training and testing datasets to provide accurate
results.
3. Modules
The project is divided into the three modules:
 Natural Language Processing (NLP)
 Machine Learning (ML)
 Database (Datasets)
Each module is discussed in detail in coming chapters.
17
CHAPTER – 5
NATURAL LANGUAGE
PROCESSING
5. NATURAL LANGUAGE PROCESSING
In this chapter, we will study about the Natural Language Processing and how does it works.
Then we will look some NLP techniques which are used for text pre-processing. At last, our
project implementation related to NLP is explained with the help of some code snippets.
5.1 INTRODUCTION TO NLP
In this section, we will study about the Natural Language Processing as what actually it is and
how it is useful in different areas. We will also study about the working of NLP. We have
used NLP in our project which will help us to extract data using its text pre-processing
abilities.
1. What is NLP?
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that makes
human language intelligible to machines. NLP combines the power of linguistics and
computer science to study the rules and structure of language, and create intelligent systems
(run on machine learning and NLP algorithms) capable of understanding, analyzing, and
extracting meaning from text and speech.
2. What is NLP used for?
NLP is used to understand the structure and meaning of human language by analyzing
different aspects like syntax, semantics, pragmatics, and morphology. Then, computer science
transforms this linguistic knowledge into rule-based, machine learning algorithms that can
solve specific problems and perform desired tasks.
The NLP methods are extremely valuable for sentiment analysis. It assists in recognizing the
sentiment among several online posts and comments. The business firms utilize NLP methods
to learn about the customer’s opinion about their product and services from online reviews.
Utilizing NLP the Automatic summarization can be performed more efficiently. Automatic
summarization is important not just for summing up the significance of documents and data,
yet in addition, for understanding the emotional implications of the data, for example, in
gathering information from social media.
18
3. How does NLP work?
Figure 5.1: NLP working
By using NLP tools, the input data is pre-processed and data is converted into something that
a machine can understand. Then machine learning algorithms are fed with the outcomes to
train machines to make associations between a particular input and its corresponding output.
The figure 5.1 shows the complete overview of how NLP working is done step wise.
In our project, the NLP is used to understand the user’s input and extract key features i.e.
symptoms so that they can be fed to machine learning algorithms to predict the corresponding
disease based on the user’s symptoms.
5.2 NLP TECHNIQUES
In this section, we will see the various NLP techniques which can help in preprocessing the
input text. Here we have discussed those techniques which we are using in our project to
extract the symptoms from the users input text.
1. Tokenization
Tokenization is an essential task in natural language processing used to break up a string of

words into semantically useful units called tokens. Sentence tokenization splits sentences
within a text, and word tokenization splits words within a sentence. Generally, word tokens
are separated by blank spaces and sentence tokens by stops.
An example of how word tokenization simplifies text:
Sentence: “I have a fever”
After word tokenization: ‘I’, ‘have’, ‘a’, ‘fever’
19
5.2.2 Lemmatization & Stemming
Stemming usually refers to a crude heuristic process that chops off the ends of words in the
hope of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the lemma.
3. Stop word Removal
Removing stop words is an essential step in NLP text processing. It involves filtering out
high-frequency words that add little or no semantic value to a sentence, for example, which,
to, at, for, is, etc. You can even customize lists of stop words to include words that you want
to ignore.
4. Bag of word & TF-IDF
A bag-of-words model is a way of extracting features from text for use in modeling, such as
with machine learning algorithms.
TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to
quantify a word in documents; we generally compute a weight to each word which signifies
the importance of the word in the document and corpus.
5.3 IMPLEMENTATION
In this section, we will see the implementation of all discussed NLP methods and speech
recognition part for voice input.
For speech recognition, we have implemented the python code to get the input as voice from
user’s microphone which will get converted into the corresponding text. Here is the code
snippet for speech recognition as figure 5.3.1:
20
Figure 5.3.1: Speech recognition code
For text pre-processing, we have used various NLP techniques like tokenization, stemming,
lemmatization and removal of stop words. You can see the code snippet as figure 5.3.2:
Figure 5.3.2: Text Pre-processing code
21
To identify the word importance in the user’s input, we have implemented two more NLP
methods, Bag of Words and TF-IDF. Using these methods, we can get a numerical value
which tells the importance of each word present in the corpus.
Figure 5.3.3: Output of NLP Methods
We have tested these methods on 2 statements. The snippet of the output of these methods is
shown in figure 5.3.3.
22
CHAPTER – 6
MACHINE LEARNING
6. MACHINE LEARNING
This chapter introduces the basic concepts of machine learning, as well as different
algorithms mainly used for disease prediction. The subsequent sections contains screenshots
of the code as well as the output of the algorithms selected for the project.
6.1 INTRODUCTION TO ML
In this section, we will learn about machine learning concepts and history of
Machine Learning in healthcare domain.
1. What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.
But, using the classic algorithms of machine learning, text is considered as a sequence of
keywords; instead, an approach based on semantic analysis mimics the human ability to
understand the meaning of a text. Machine learning algorithms are often categorized as
supervised or unsupervised.
 Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis
of a known training dataset, the learning algorithm produces an inferred function to
make predictions about the output values. The system is able to provide targets for any
new input after sufficient training. The learning algorithm can also compare its output
with the correct, intended output and find errors in order to modify the model
accordingly.
 In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. Unsupervised learning studies
how systems can infer a function to describe a hidden structure from unlabeled data.
The system doesn’t figure out the right output, but it explores the data and can draw
23
inferences from datasets to describe hidden structures from unlabeled data.
 Semi-supervised machine learning algorithms fall somewhere in between supervised

and unsupervised learning, since they use both labeled and unlabeled data for training
– Typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring
unlabeled data generally doesn’t require additional resources.
 Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.
2. History of Machine Learning in Healthcare

Research in the 1960s and 1970s produced the first problem-solving program, or expertsystem,
known as Dendral. While it was designed for applications in organic chemistry, it provided
the basis for a subsequent system MYCIN, considered one of the most significant early uses
of artificial intelligence in medicine. MYCIN and other systems such as INTERNIST-1 and
CASNET did not achieve routine use by practitioners, however.
The 1980s and 1990s brought the proliferation of the microcomputer and new levels of
network connectivity. During this time, there was recognition by researchers and developers
that AI systems in healthcare must be designed to accommodate the absence of perfect data
and build on the expertise of physicians. Approaches involving fuzzy set theory, Bayesian
networks, and artificial neural networks, have been applied to intelligent computing systems in
healthcare.
Medical and technological advancements occurring over this half-century period that have
enabled the growth healthcare-related applications of AI include:
 Improvements in computing power resulting in faster data collection and data

processing.
 Widespread implementation of electronic health record system.
24
 Improvements in natural language processing and computer vision, enabling machines
to replicate human perceptual processes
 Enhanced the precision of robot-assisted surgery
 Improvements in deep learning techniques and data logs in rare diseases
6.2 RESEARCH ON ML ALGORITHMS
Machine learning can be introduced as a scientific discipline that focuses on how computers
learn from data and continuously improve themselves. It is mainly based on probability and
statistics. But it is more powerful than the standard statistical methodologies when it comes to
decision making. Information gathered from a dataset which is being given to the algorithm is
called features. The accuracy of the predictions made by the model is dependent on the
quality of the features provided to the algorithm. It is the duty of a machine learning
developer to detect the subset of features that could best fit the purpose, increasing theaccuracy
of the model. This is not an easy task. Continuous experiments should be carried out to
identify the said feature subset for the algorithm. When considering putting a machine
learning algorithm to applications, there are basically three steps to follow, which are training,
testing, and validation. Training is important as the accuracy of the results will be depending
on the training dataset. Using the test dataset, the performance of the algorithm will be
measured. When using the test data for measuring the performance, it is also important to lower
the bias and to increase the variance in this testing period. A good machine learning algorithm
must optimize the bias-variance trade-off. The evaluation of the final machine learning
algorithm performance is done based on the validation dataset in the validation period. As a
start, it would be better to have an idea about various approaches taken in machine learning
along with several algorithms that are being used excessively for clustering and classification
purposes in machine learning.
 Supervised Learning
In supervised learning, a training set is provided with appropriate objectives in this approach.
Classification and regression are the two categories found in supervised learning. In
classification, with the use of classification methods, the trained system allocates inputs into
classes. In regression, the sources are continuous rather than discrete. The root-mean-squared
error is being used to evaluate regression predictions, while accuracy is being used to evaluate
25
classification predictions. Supervised learning has the goal of predicting a known output based
on a common dataset. Tasks performed by supervised learning can most of the time be
performed by a trained person as well. Supervised learning focuses on classification which
involves choosing among subgroups to best describe a new instance of data and prediction,
which involves estimating an unknown parameter. This is often used to estimate and model
risk while finding relationships which are not readily visible to humans. Below are a few
supervised learning algorithms which are widely used in the field of computational biology and
biomedicine.
K-Nearest Neighbour (KNN)
KNN is a popular supervised classification algorithm which is used in many fields such as
pattern recognition, intrusion detection, and so on. KNN is a simple algorithm which is easy to
understand. Even the accuracy is high in KNN, but the issues are that it is computationally
expensive and it has a high memory requirement as both testing and training data need to be
stored. A prediction for a new instance is obtained by finding the most similar instances at
first and then summarizing the output variable according to those similar instances. For
regression, this can be the mean value, and for classification, this may be the mode value. To
determine the similar instance, the distance measure is used. Euclidean distance is the most
popular approach used to calculate the distance. The training dataset should be vectors in a
multidimensional feature space, each with a class label.
Support Vector Machine (SVM)
SVM is a supervised machine learning algorithm which is used to address mainly

classification problems but also used for regression issues. In this algorithm, initially, the data
items are plotted as points in an n-dimensional space with the feature value being the particular
coordinate. Then, it identifies the hyperplane that separates the datapoints into two classes. By
this, the marginal distance between the decision hyperplane and instances that are close to the
boundary can be maximized [5].What brings SVM ahead of other algorithms is that it has basic
functions that can map points to other dimensions by using nonlinear relationships. As it
divides the datapoints to two classes, SVM is also known as the nonprobabilistic binary
classifier. SVM has more accuracy when compared with many other algorithms. But it is best
suited for problems with small datasets. The reason is that when the dataset keeps on getting
larger, the training becomes more complex and time consuming. When data have noise, it
cannot perform well. To make the classification more efficient, SVM uses a subset of training
26
points. SVM is capable of solving both linear and nonlinear problems, but nonlinear SVM is
preferred over linear SVM as it has better performance.
Decision Trees (DTs)

DT is a supervised algorithm which has a tree like model where decisions, possible
consequences, and their outcomes are being considered. Each node carries a question, and each
branch represents an outcome. The leaf nodes are class labels. When a leaf node is being
reached by a sample data, the label of the corresponding node will be assigned to the sample.
This approach is suited when the problem is simple and when the dataset is small. Even though
the algorithm is easy to understand, it has certain issues such as the overfitting problem and
biased outcomes when working with imbalanced datasets. But DT is capable of mapping both
linear and nonlinear relationships.
 Classification and Regression Trees (CARTs)
CART is a predictive model from which the output value is predicted based on the existing
values in the constructed tree. The representation for the CART model is a binary treein which
each root represents a single input and a split point on that variable. Leaf nodes contain an
output which is used to make predictions.
Logistic Regression (LR)
LR is a popular mathematical modeling procedure which is used for epidemiologic datasets in

the area of machine learning. It first calculates using the logistic function. Then, it learns the
coefficients for the logistic regression model and then finally makes predictions using that
logistic regression model. This model is a generalized linear model and has two parts,namely,
linear part and link function. The linear part is responsible for carrying out the calculations of
the classification model, and the link function is responsible for delivering the output of the
calculation. LR is a supervised machine learning algorithm which needs a hypothesis and a
cost function. It is to be noted that optimizing the cost function is important.
27
Random Forest Algorithm (RFA)
RFA is a trending machine learning technique which is capable of both regression and
classification. It is a supervised learning algorithm in which the ground methodology is
recursion. In this algorithm, a group of decision trees are being created and the bagging method
is used for training purposes. RFA is insensitive to noise and can be used for imbalanced
datasets. The problem of overfitting is also not prominent in RFA.
Naive Bayes (NB)
NB is a classification algorithm which is used for binary and multiclass problems. The NB
classifiers are a collection of classifying algorithms that are based on the Bayes theorem. But
they all adhere to a common principle which is every pair of features being classified must be
independent of each other. This is a bit similar to SVM, but the process takes advantage from
statistical methods. In this method, when there is a new input, the probabilistic value will be
calculated among the classes with regard to the given input and the data will be labeled with
the class which has the highest probabilistic value for the given input.
 Unsupervised Learning
When a developer does not have a clear understanding of the data that are involved with the
system, it is not possible to label the data and provide them as the training dataset. In these
cases, the machine learning algorithms themselves can be used to detect similarities and
differences between the data objects. This is the unsupervised approach of machine learning.
In this method, existing patterns will be identified and the data will be clustered according to
the identified patterns. Therefore, in unsupervised learning, the system makes decisions
without being trained by a dataset as no labeled data are being given to the system which could
be used for predictions. It is to be noted that unsupervised learning is an attempt to find
naturally occurring patterns or groups within data. The challenging part in it is to find whether
the recognized patterns or groups are useful in some way. This is the reason for unsupervised
learning to play a major role in precision medicine. As a simple example, when grouping
individuals according to their genetics, environment, and medical history, certain relationships
among them which were not visible before might get identified by unsupervised machine
learning algorithms. K-means, mean shift, affinity propagation, density-based spatial clustering
of applications with noise (DBSCAN), Gaussian mixture modelling, Markov random fields,
iterative self-organizing data (ISODATA), and fuzzy C-means systems are a few examples for
unsupervised algorithms.
28
Clustering is an approach in unsupervised learning, and it can be used for dividing inputs into
clusters. But these clusters are not identified initially but are grouped based on resemblance [.
In clustering, the root approaches are separated as per the different features that they carry.
They can be partitioning (k-means), hierarchical, grid-based, density-based, or model-based,
and they can be further divided as numerical, discrete, and mixed data types. Inheritance
relationships between clustering algorithms within an approach show common features and
improvements that they make on each other. Speed, minimal parameters, robustness to noise,
outliers, redundancy handling, and object order independence are the desired clustering
features which are required in a clustering algorithm to be implemented within a biomedical
application. Clustering algorithms are used when datasets are too large and complex for manual
analysis. Therefore, they must be fast and they must not be affected by redundant sequences.
29
Table 6.2.1: The difference between supervised learning and unsupervised learning
Learning Data Type Usage Type Output Affecte Scalable Cost

Class
Accuracy d by
/ Missing
Perform Data
ance
Supervised Labeled Classification High Yes Yes, but Expensive
Regression we need to
label large
volumes
of data
automatically.
Unsupervised Unlabeled Clustering Low No Yes, but we Inexpensive

Transformations need to verify
the accuracy
of the
predicted
output.
30
Table 6.2.2: Summary of the reviewed ML algorithms.
Algorithm Learning Used for Positives Negatives

Name Type
K-Nearest Supervised Classification Nonparametric approach. Takes a long time to calculate
Neighbor , Regression Intuitive to understand. Easy the similarity between the
(K-NN) to implement. Does not datasets. The performance is
require explicit training. Can degraded because of
be easily adapted to changes imbalanced datasets. The
simply by updating its set of performance is sensitive to the
labeled observations. choice of hyper parameter (K
value). The information might
be lost, so we need
to use homogeneous features.
Naïve Supervised Probabilistic Scanning of data by Requires only a small amount of
Bayes classification looking at each feature training data. Determines only
(NB) individually. the variances of the variables for
Collecting simple per-class each class.
statistics from each feature
helps with increasing the
assumptions
accuracy.
Decision Supervised Prediction, Easy to implement. Can Sensitive to the imbalanced
Trees Classification handle categorical and dataset and noise in the training
(DTs) continuous attributes. dataset. Expensive, and needs
Requires little to no data more memory. Must select the
preprocessing. depth of the node carefully to
avoid variance and bias.
Random Supervised Classification Lower correlations across the Does not work well
Forest , Regression decision trees. Improves the on high- dimensional,
DT's performance. sparse data.
Support Supervised Binary More effective in high- Selecting the best

Vector classification, dimensional space. Using the hyperplane and kernel trick
Machine Nonlinear kernel trick is the real strength is not easy.
(SVM) classification of SVM.
31
6.3 IMPLEMENTATION
Upon going through certain research papers, we decided to try our data on two
algorithms one of them being random forest.
 Random Forest
As stated earlier Random Forest is a classifier that instead of relying on one decision tree, it
takes the prediction from each tree and based on the majority votes of predictions, gives the
final output. The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting. Overfitting refers to the scenario where a machine learning model
can’t generalize or fit well on unseen dataset. It occurs when a function corresponds too
closely to a dataset failing to fit additional data, and this may affect the accuracy of
predicting future observations. It is a binary decision tree that is constructed by firstly,
selecting random K data points from the training set. Build the decision trees associated with
the selected data points. Choose the number N for decision trees that we want to build. Repeat
the steps, for new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes. Now, another great quality of the
random forest algorithm is that it is very easy to measure the relative importance of each
feature on the prediction. Sklearn provides a great tool for this that measures a feature's
importance by looking at how much the tree nodes that use that feature reduce impurity
across all trees in the forest. It computes this score automatically for each feature after
training and scales the results so the sum of all importance is equal to one. In the following
code, we fit the Random forest algorithm to the training set. To fit it, we have imported the
RandomForestClassifier class from the sklearn.ensemble library. In the code, the classifier
object takes the parameter, n_estimators. The required number of trees in the Random Forest.
The default value is 10 but we have taken 100. In general, a higher number of trees increases
the performance and makes the predictions more stable, but it also slows down the
computation. Now, since our model is fitted to the training set, so we can predict the test
result. For prediction, we have created a new prediction vector y_pred.
32
Figure 6.3.1: Execution of Random Forest
Figure 6.3.2: Sample input to the code
33
Figure 6.3.3: Output of the following code
 K-Nearest Neighbor (K-NN)
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on

Supervised Learning technique. K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to the
available categories. K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm. K-NN algorithm can be used for
Regression as well as for Classification but mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.
34
How to select the value of K in the K-NN Algorithm?
There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5. A very low value for K such
as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for
K are good, but it may find some difficulties.
Figure 6.3.4: Execution of K-Nearest Neighbor
Figure 6.3.5: Sample input to the code
35
Figure 6.3.6: Output of the following code
36
CHAPTER-7
DATABASE
7. DATABASE
There are many changes taking place in the healthcare sector. Healthcare databases are an
important part of running the entire operations. A database is any record that a practitioner
maintains in paper form or on a computer. It does not matter whether it is a sole practitioner or
corporate bodies. With technological innovations, medical facilities are leaning towards online
functioning of services.
7.1 DATA IN HEALTHCARE
The Healthcare system generates data that requires delicate handling. A patient’s life depends on
this information, and it is therefore important for the Healthcare provider to be able to access it
in the shortest time possible and ensure that the information is correct to the best of the
knowledge.
The healthcare data is very crucial and difficult to manage and handle because of the following
reasons –
1. Efficiency Management of data is important since a lot of data is to be stored for one
patient only and there are lot of patients suffering from various disease so the data
base should also be updated on regular intervals.
2. Data Manipulation is also a tedious task as the database in healthcare is huge and it
need to updated every now and then.
3. Since data is huge so it should be organized, maintained and managed in such a way
that it can be easily fetched or extracted in the shortest possible time and it should be
available to the user whenever needed.
4. Since the data is related to patient’s life there cannot be scope of any mistake in this
data.
5. Data security is also important since it a crucial data.
7.2 DATABASE DEVELOPMENT

Database development is the most important step since the chatbot functioning is completely
dependent on data, if suppose data is not present or developed then Machine Learning
algorithms, NLP and even the basic function of the chatbot won’t work withoutdata.
Database is required in the functioning of each and every step of chatbot. There are varioustypes
of dataset that are to be created some of them are listed below –
 Training Dataset – Data used to train the machine learning algorithm
37
 Testing Dataset – Completely new dataset to check the accuracy of algorithm for
completely new inputs to machine learning algorithm.
 Question Answer Dataset – Required for basic interaction with the user
 Dialogue Datasets
From the above listed datasets the most important datasets are Training and Testing dataset
because these are used to train the chatbot.
To develop the dataset Web Scrapping technique is used to extract the data from various sources
of database which is present on the internet.
Where Web Scrapping or Web harvesting is a technique is a technique used for extracting data
from websites. The web scraping directly access the World Wide Web using the Hypertext
Transfer Protocol or a web browser.
Web Scrapping can be done using Python programming using BeautifulSoup and Pandas
library. The scrapped data can be of the format CSV, XML or JSON as per the user needs.
After the data is scrapped from various sources then that data is to be combined called as data
integration.
After Data integration comes the data cleaning step. Since the data from the internet is not in the
proper format as one want or it may contain some unwanted characters or text or repetitive data
so that is to be cleaned and that should pe properly formatted before that data is used in Training
the algorithms.
And once the training data is created using python programming Testing data set is alsocreated.
7.3 IMPLEMENTATION
For developing Training dataset we performed web scrapping on some websites andextracted the
medical data from that website. This was done using Python Programming, inbuilt python
Libraries such as BeautifulSoup and Pandas was used.
In that web scrapping code first the class name of the data was checked in the inspect section of
the web page and that was passed as an attribute in the python code also the url of the page from
which the data is to be extracted is also passed in the program and through read_html method
present in python the contents of the table were read from the website and if the scrapped data is
not present in tabular form on the website then using dataframe we can convert the scrapped data
into tabular form and then the scrapped data is exported into CSV file using to_excel method.
38
Figure 7.3.1 Code for Web Scrapping
Figure 7.3.2 Code for Exporting Scrapped Data to CSV File
After the Data is Scrapped then using excel commands and find and replace option data was
cleaned and formatted according to our needs.
Command on Excel to remove numbers from alphanumeric data –

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITU
TE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(B3,1,""),2,""),3,""),4,""),5
,""),6,""),7,""),8,""),9,""),0,"")
39
Figure 7.3.3 Snapshot of Cleaned Training.csv File
Figure 7.3.4 Snapshot of Cleaned Testing.csv
40
CHAPTER-8
CONCLUSION
AND
FUTURE WORK
8. CONCLUSION AND FUTURE WORK
CONCLUSION
The proposed system is designed for understanding the user query and based on the
symptoms faced by the user give proper diagnosis in efficient and cost effective
way. The main aim of the model is to provide healthcare service to people living in
rural areas 24x7 with ease as they don’t have the proper infrastructure, facilities and
money to to access to healthcare services.
The chatbot is expected to provide assistance in emergency situation and detect

solutions for non-severe medical issues till the time the doctor sees or consults a
doctor.
Currently the system is under development and the system is tested on two
Machine Learning algorithms we aim to develop a chatbot that is easy to use and
can fulfill all the requirements of the patient.
FUTURE WORK
 At present we have worked on 2 machine learning algorithms i.e. Random
Forest and KNN algorithm so we need to test the remaining algorithms and
finalise the best Machine Learning algorithm that works well with our
database and provide correct and accurate results.
 The Dataset currently has only 10 diseases and symptoms related to it, so in
future we will add more diseases and make the system more efficient in
predicting the diseases for given set of symptoms.
 Working on NLP module followed by Integration and deployment of the
modules.
41
REFERENCES
REFERENCES
[1] S. Roca, J. Sancho, J. Garcia and A. Alesanco, “Microservice chatbot

architecture for chronic patient support,” Journal of Biomedical Informatics, vol.
102, Oct., pp. 1–28, 2019
[2] T. Nadarzynski, O. Miles, A. Cowie and D. Ridge, “Acceptability of artificial

intelligence (AI)-led chatbot services in healthcare: A mixed-methods study,”
Digital Health, vol. 5, no. 7, Aug., pp. 331–341, 2019
[3] D. Balasubramaniam, C. Kanmanipappa, S. Velmuruganand and M. Saravanan,

“Design and Development of Smart Healthcare Chatbot Application Using AI –
ML,” Journal of Natural Remedies, vol. 21, Nov., pp. 13–19, 2020
[4] B. Kidwani, Nadesh RK, “Design and Development of Diagnostic Chabot for
supporting Primary Health Care Systems,” Procedia Computer Science, vol. 167,
pp. 75– 84, 2020
[5] S.A. Kumar, C.V. Krishna, P.N. Reddy, B.RK. Reddy, I.J. Jacob, “Self-
Diagnosing Health Care Chatbot using Machine Learning ,” International Journal
of Advanced Science and Technology, vol. 29,No. 5, pp. 7323– 9330, 2020
[6] G. Battineni, N. Chintalapudi, F. Amenta, “AI Chatbot Design during an

Epidemic like the Novel Coronavirus,” MDPI Healthcare, pg.8-154, 2020
[7] K Jayashree, Monika K A, Preetha R, Piraisoodan S P, “The Smart HealthCare

Prediction using Chatbot”, International Journal of Recent Technology and
Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-2, July 2020
[8] M.S Bennet Praba, Sagari Sen, Chailshi Chauhan, Divya Singh, “Ai Healthcare
Interactive Talking Agent using Nlp”, International Journal of Innovative
Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9
Issue-1, November 2019
[9] Nicholas A. I. Omoregbe, Israel O. Ndaman, Sanjay Misra, Olusola O.

Abayomi-Alli, Robertas Damasevicius, “Text Messaging-Based Medical Diagnosis
Using Natural Language Processing and Fuzzy Logic”, Hindawi Journal of
Healthcare Engineering, Article ID 8839524, September 2020
[10] K. Rarhi, A. Mishra and K. Mandal, “Automated Medical Chatbot,” SSRN

Electronic Journal, vol 21, Jan 2017
42
ANNEXURE I
SURVEY ON HEALTHCARE CHATBOT TECHNIQUES
1
Dr.D.S. Adane, 2Simran Singh, 3Parthsarthi Pahuja, 4Yash Gupta
1
Head, Department of Information Technology, Shri Ramdeobaba College of Engineering and
Management, Nagpur, 440013.
2,3,4
Students of Department of Information Technology, Shri Ramdeobaba College
of Engineering and Management, Nagpur, 440013.
Email – 1adaneds@rknec.edu, 2singhs_3@rknec.edu, 3pahujapd@rknec.edu,

4
guptayp@rknec.edu
*************************
Abstract – With the current growth in the the prediction of outcomes, as well as providing
interest of individuals in health, life care, and fast risk assessments and exact resource
disease, medical institution services have been allocation. A chatbot is facilitated by linkages
moving from remedy awareness to prevention with the major social network service
and fitness control. Witnessing a clinical- messengers, and general users can easily access
provider paradigm shift because of the extended and receive various health services. Many
lifestyles expectancy, aging, life-style studies can solve this problem with some kind of
adjustments, and profits increases, and chatbot or health assistant. The focus of this
consequently, the idea of the smart fitness research is to gain a better understanding of
provider has emerged as a first-rate issue. chatbots that might assist individuals in
However, as the quantity of information is receiving the same and appropriate treatment
growing and the clinical-information that a doctor would provide. The main AI
complexity is intensifying, the constraints of the approaches in the forefront are machine
preceding strategies are facing an increasing learning and natural language processing.
number of problems. With the incoming trends
Keywords – Healthcare, Chatbot, NLP, ML
in technology, AI chatbots have managed to
pave their way in the healthcare domain.
Machine Learning (ML) is supporting
healthcare professionals in a range of aspects. INTRODUCTION
In healthcare, machine learning aids in the The world is witnessing an unprecedented and
analysis of hundreds of distinct data points and generation-defining pandemic that has affected
1
millions of people’s health, finances, and vital discern the intricacy of both text and spoken
aspects of their daily lives. AI-based chatbots speech. The incorporation of data is another
have been a significant help in flattening the significant factor in the development of a
curve of this pandemic. Governments and chatbot.
hospitals have been inundated with queries and
Machine Learning relies heavily on data. Without
cases pouring in – from people infected with the
data, any model can't be trained, and all of the
virus to others who are just concerned and taking
existing research and technology will go in vain.
extra precautions. To relieve the system of some
Gigantic corporations are investing billions of
of the workload [3], chatbots can be used to
dollars to gather as much accurate data as
assess patients, address queries, book
possible. Few years back, Facebook, one of the
consultations, and appointments, and handle
largest technology companies, paid a significant
administrative tasks like collecting customer
amount of $19 billion to acquire WhatsApp's
insurance information, processing invoices and
user data. Facebook capitalizes on user data to
renewing prescriptions. Chatbots can also aid in
improve efficiency and create revenue, making
early diagnosis of common diseases such as
data extremely crucial. Although the chatbot
breast cancer, diabetes, coronary artery disease,
offers numerous advantages, it also has certain
and tumors can help patients control and decrease
potential drawbacks and limitations that make
the likelihood of these disorders being fatal to the
development more challenging. Security remains
patient.
an issue of concern when it comes to handling
A chatbot is a software robot that can resemble patient information. The adoption of AI systems
natural language and engage with humans via in healthcare is anticipated to be hindered by a
automated dialogues. Even prior to the onset of lack of empathy for chatbots, concerns about
the pandemic, the conventional doctor-patient accuracy, reliability, and confidentiality [2].
relationship appears to have stayed unchanged.
Aside from technological constraints, the
Chatbots are intended to be accessible at all times
language used by the bot for communication with
of the day and have been a significant assistance
the user is also a constraint. Multilingual
in resolving the isolation of individuals suffering
chatbots can offer help and troubleshoot
from mental illnesses including depression and
problems in this area [15]. In the following
lethargy.
sections, we discuss various techniques used for
Traditionally, healthcare and telemedicine were making a chatbot, followed by presenting the
centred on acquiring facts and statistics from methods of this study. In the subsequent sections,
patients and moving them into healthcare we present the results, discussion and conclusion
contexts, not the alternative way around. of the study.
Chatbots are the precise equipment to make this
turnaround feasible, being aligned with chronic
patients’ needs, interacting with them in the LITERATURE REVIEW
identical manner that patients communicate with
their friends and relatives. [1] NLP and Machine Chatbot architecture is the backbone of the
Learning are the core technologies that power chatbot. The type of architecture required for the
chatbots. When a user asks a question to a chatbot relies upon different factors like use-case,
chatbot, a set of complicated algorithms evaluate domain or the chatbot type. S. Roca et al. [1]
the received input, interpret what the user is proposed a microservice based architecture to
asking, and then select the appropriate answer. unleash the full potential of chatbots grounded on
Chatbots must rely on algorithms' ability to three pillars: scalability by means of
microservices, standard data sharing models
2
through HL7 FHIR [16] and standard chatbot utilizing AI that can analyze the infection
conversation modeling using AIML. The and give starter safeguards and pills to the
architecture has two core microservices: Proxy - ailment before counseling a specialist. The user
translates the message received from the user into sends the instant message (voice) using Google
the internal standard format and API gateway - API. To comprehend user queries, NLP concepts
decides which chatbot microservice is best suited such as tokenization, TF-IDF, and n-gram are
to receive the user message. The messaging employed. The inquiry presented to the bot
platform used is Signal providing end-to-end which isn't grasped or not present in the database
encryption. For data storage backups configured is additionally handled by the outsider, master
to an external encrypted hard disk are used. A framework. The user voice is extracted with the
generic chatbot was developed along with the enhanced acoustic extraction through the
extension for specific disease: psoriasis, microphone. Now, processed voice Raspberry-Pi
providing a photographic record as the main gets the user question further sending it to the
feature for skin area tracking. The objective is to chatbot application. Without involving a human
provide comfort and convenience when using specialist, the chatbot provides clinical medicine
medical services, since the tool can be used based on the client's enquiries. When it comes to
anywhere. This, in turn, can reduce costs in the handling patient information, the concern of
healthcare sector while improving the quality of protection and security is highlighted. Although,
life of patients and reducing face-to-face medical the use of Deep Learning techniques might
consultations. strengthen the chatbot's efficiency.
T. Nadarzynski et al. [2] provided one of the first To eradicate the conventional doctor-patient
studies exploring the acceptability of AI-led interaction and make this procedure more
chatbot systems for healthcare from the efficient B. Kidwani et al. [4] devised a system
perspective of the general public with no pre- that uses top-down approach and then generates a
existing medical conditions. It is anticipated that diagnosis using Decision tree algorithm The
by 2024, every patient in England would have system functioned for 50 diseases and 150
digital access to primary care consultations, with symptoms, with the chatbot receiving input in the
a reduced need for face-to-face outpatient visits. form of a voice message. According to the
Semi-structured interviews and an online survey results, the system's reaction time ranged from
advertised via social media were conducted on 10ms to 20ms, depending on the number of
participants above the age of 18. The quantitative symptoms given by the user, and the system's
data were analysed using binary regressions. The accuracy was 75%.
studies show that chatbots were seen as a
S.A. Kumar et al. [5] presented an AI-powered
convenient and anonymous tool for minor health
medical chatbot that can identify diseases and
issues that may carry a level of stigma, the lack
provide basic information about them before
of empathy, concerns about accuracy,
consulting a doctor. The framework is
trustworthiness and privacy are likely to
implemented upon three primary components:
compromise the adoption of AI systems in
user validation and symptom extraction, mapping
healthcare. Governmental endorsement, mass
extracted symptoms with training datasets and
media campaigns or personal views of social
specifying the disease and recommending a
media role models are likely to influence
doctor. User validation is performed using the
potential adopters and overcome potential
user data stored in the SQLite database, and
barriers to engagement.
symptoms are retrieved using NLP tokenization
D. Balasubramaniam et al. [3] proposed a clinical techniques; only the significant keywords
3
relevant to the user's health are retrieved from the money for every minor health issue which
provided text input. After the symptoms are impacts the efficiency of medical healthcare. So
extracted from the user query, they are mapped considering this as an issue, the authors proposed
with the symptoms present in the training dataset, a smart healthcare prediction chatbot which will
a list of symptoms is shortlisted, and the disease help people to predict their health issues early at
is predicted using the Decision Tree Algorithm their places before they visit the doctor or
and classified as a major or minor disease, which hospital for minor health problems. The chatbot
is also saved in the dataset. If the projected takes the user's details and stores it in the
disease is a major one, the system advises the database. Then the user can start the conversation
user to see a doctor right away; if it is a minor with the bot. The chatbot will clarify the user’s
one, the system provides the user with an symptoms with a series of questions. The
appropriate diagnosis. symptoms will be confirmed and disease will be
categorized as minor and major. If it is a major
Upon analyzing the severity of the COVID-19
disease, the user will be suggested with the
pandemic and the current availability of doctors
doctor’s details for further treatment. The system
and health services, G. Battineni et al. [6]
can also fix the appointments with the doctors or
proposed an artificial intelligence-based chatbot
can have a text-talk with the doctors to ask
that can measure the severity of coronavirus
questions and state their concerns to them
infections. The functionality of chatbot is defined
regarding their health conditions. The system was
in two ways: request analysis or return. AIML is
built using AIML component design like pattern
used to manage user requests and identify
matching and category classification. For
message pattern responses. The AIML uses
shortlisting the disease, the system performs
stimulus-response mechanisms and allows for
some tasks such as eliminating end word and
simple dialogue modeling. To achieve the
punctuation, tokenizing words, assigning scores
intended response, natural language
to the diseases, disqualifying disease and finally
programming (NLP) is used. The severity is
sorting the list. Using this list, the final disease is
calculated by the chatbot using data from a
predicted as output.
predetermined questionnaire. A predefined tag
pattern in AIML is used to identify the user M.S Bennet Praba et al. [8] proposed a healthcare
response to these queries. Following the interactive talking agent that will provide the
collection of responses, a threshold value is information related to underweight or
determined to evaluate the severity of infection. overweight. The system was built using
After the response is collected a threshold value Language Understanding Intelligence Service
(H) is calculated which determines the severity of and NLP where they mainly focused on
infection. If H ≥ 1, then the chatbot triggers the Morphology (Morphemes). The user provides
infection severity and connects to the doctor their height and weight as input to the system.
immediately. This chatbot was compared to four The system will calculate the Body Mass Index
existing chatbots: WHOnCOV-19 Bot, Fight (BMI) value and determine whether the user is
COVID Messenger Bot, SAJIDA Corona Bot, underweight or overweight. The system will then
and Aarogya Setu on a variety of parameters. The provide the diet plan and exercises corresponding
proposed chatbot scored an average score of 4.8 to the determined output.
which was the highest when compared to other
Nicholas A. I. Omoregbe et al. [9] proposed a
bots which scored 4, 3.7, 3.55 and 4.05
text-messaging based medical diagnosis system
respectively.
that works with the direct approach of question
K Jayashree et al. [7] stated that we spend lots of and answering technique to suggest the medical
4
diagnosis. The system was deployed on a utilizing a latent component model to rebuild
Telegram platform as a bot where the user sends missing data, statistical knowledge, determining
the symptoms by a text message. The author has the most common chronic diseases, handling
used NLP techniques for text processing which structured data, and interacting with hospital
includes noise removal, tokenization, parser, text specialists to extract essential features. Three
matching, tagging, feature selection and conventional machine learning algorithms, i.e.,
extraction and applied fuzzy rules to the Naïve Bayesian (NB), K-nearest Neighbor
knowledge base. ML was used to provide (KNN), and Decision Tree (DT) algorithm are
category prediction using the fuzzy SVM used. For unstructured text data, features are
classifier. The whole system was built selected automatically using CNN algorithm. A
successfully and could suggest a diagnosis as novel CNN-based multimodal disease risk
required. prediction (CNN-MDRP) algorithm is used for
structured and unstructured data. Stochastic
K. Rarhi et al. [10] presented an AIML-based
Gradient Descent Algorithm is used to fill
healthcare chatbot capable of delivering
missing data. Receiver operating characteristic
diagnosis and solutions to the end user depending
(ROC) curve, area under curve (AUC), precision
on the user's symptoms. The algorithm collects
and recall are used to evaluate the pros and cons
keywords by removing stop-words, tokenizing,
of the classifier. For structured data, Naïve
and giving scores, culminating in a disease
Bayesian gave the highest accuracy whereas
shortlist. The algorithm shortlists the top three
CNN gave the best results.
symptoms and compares them to the user's
symptoms, then determines the severity of the The field of research continues to expand, owing
disease based on the symptoms. If the problem is to the increasing relevance and efficacy of
too serious for the chatbot to resolve, it will supervised machine learning algorithms for
directly connect the user with the doctor and predictive disease modeling. Little research was
provide the doctor with the user's chat history. found on articles employing different supervised
The Chatbot was assessed using General Word learning algorithms for disease prediction. As a
Percentage (GWP) analysis and combining the result, S. Uddin et al. [12] seek to explore
results with a terminology detection test analysis, fundamental trends across various types of
which showed an average percentage of how supervised machine learning algorithms, their
many times the Chatbot detects medical performance accuracies, and the diseases in
terminologies with an increase in non-medical examination. Total of 7 algorithms were taken
terminology. Chatbot completed Terminology into account namely, Logistic regression, Support
detection with 56.6 percent accuracy on an vector machine, Decision tree, Random forest,
average of 73.6 percent of the unrelated Naïve Bayes, K-Nearest Neighbor and Artificial
terminology included in our test messages. neural network. The results show that the
Support Vector Machine algorithm is applied
Accurate analysis of medical data benefits early
most frequently followed by the Naïve Bayes
disease detection, patient care, and community
algorithm. However, the Random Forest
services. . However, the analysis accuracy is
algorithm showed superior accuracy
reduced when the quality of medical data is
comparatively 53%, this was followed by SVM
incomplete. To overcome this M. Chen et al. [11]
which topped in 41% of the studies it was
proposes a new convolutional neural network
considered.
(CNN)-based multimodal disease risk prediction
algorithm using structured and unstructured data. P.S Kohli et al. [13] highlighted that early
They handled a variety of data challenges detection of disease can control or reduce the
5
chance of diseases to be fatal or dangerous for used, so that input sentences with a large number
patients in the future. The paper presents the use of tokens (or sentences with more than 20–40
of various clustering algorithms to predict the words) can be replied with more appropriate
diseases. The proposed system checks for conversation. The dataset used in the paper for
working of 6 Machine Learning algorithms training of models is used from Reddit. The
namely Logistic Regression, Decision Tree and model is developed to perform English to English
Random Forest, Support Vector Machine and translation. The main purpose of this work is to
Adaptive Boosting on three major diseases i.e. increase the perplexity and learning rate of the
Breast Cancer, Heart Attack and Diabetes and model and find Bleu Score for translation in the
checks for the most accurate algorithm in case of same language.
each disease. The most accurate outcomes were
M. Rahman et al. [15] presented a text-based
produced by Adaptive Boosting Algorithm with
chatbot that delivers Bangla diagnosis. They
accuracy of 98.57 % evaluated using p-value test
initially gathered data from a variety of sources,
and using characteristics such as clump
and then used Google API to transform the
thickness, uniformity of cell size, bare nuclei,
English terms in the database to Bangla. The
and bland chromatin for Breast Cancer detection.
training dataset had 4920 samples and 41
The data set for diabetes detection was extracted
diseases, whereas the testing dataset had 41
from the Prima Indians Diabetes Dataset, and the
samples and 41 diseases. Another file links the
most accurate algorithm for diabetes
disease's name with the specialty and kind of
identification was SVM (Linear), with an
doctors that treat it. Disease Classification
accuracy of 85.71 %. The Prima Indians Diabetes
commands and General Commands are the two
dataset had 768 instances and 8 characteristics,
basic sorts of commands available to the chatbot.
however backward data modeling eliminated
The vectorization of the Bangla text was done
three of them, leaving just five: the number of
using TF-IDF, and the similarity between texts
times pregnant, glucose concentration detected in
was calculated using the Cosine Similarity
an oral glucose tolerance test (glucose level),
Measure. Our technology responds to a simple
BMI, diabetes pedigree function, and age. The
help demand by sending an emergency SMS and
Dataset originally contained 13 attributes for
making a real-time phone call to the specified
heart disease detection, but after backward
cell phone number. This functionality, however,
modeling, only 11 significant attributes
is dependent on a number of factors, including
remained, including chest pain, blood pressure,
device access authorization, network, and
blood sugar level, electrocardiograph result,
balance. To test the accuracy of the system
maximum heart rate, exercise induced angina,
various Machine Learning Algorithms were
old peak, Slope, and number of vessels coloured.
tested on the dataset and the accuracy of SVM
In this case the most accurate algorithm was
was seen to be the highest among Decision tree,
Logistic Regression with an accuracy of 87.10%.
Random Forest, Multinomial NB, Adaptive
Manyu Dhyani et al. [14] proposed an open Boosting and KNN.
domain intelligent chatbot based on Deep
learning with Bidirectional RNN and attention In this paper, we have shortlisted 15 articles after
model. A chatbot using deep learning NMT extensive inspection. The selected articles cover
model with Tensorflow was developed having an the major domain of AI including NLP and ML.
architecture build-up of BRNN and attention In Table 1, we have summarized all the research
mechanism. The Bidirectional Recurrent Neural papers including the benefits and drawbacks of
Networks (BRNN) containing attention layers is each one.
6
Table 1: Summary of Literature Review
Title Publication Year of Methodology Pros Cons

Publishing
Microservice chatbot Journal of 2019 Messaging platform Overcomes challenges Complex working and
architecture for chronic Biomedical Signal, Docker of modularity, designing of proposed
patient support [1] Informatics Platform, Kibana stamdardadization and systems.
security.
Acceptability of artificial Digital Health 2019 NLP, Regression Identified potential Answers were
intelligence (AI)-led Techniques, Use of factors associated with collected from
chatbot services in Quantitative Data. delay in acceptability relatively experienced
healthcare: A mixed- or refusal. users of digital
methods study [2] technologies.
Design and Development Journal of 2020 NLP concepts like Innovative use of Data Security and
of Smart Healthcare Natural Tokenization, TF- Raspberry-Pi and authenticity was a
Chatbot Application Remedies IDF, n-gram, Google Google API. concern as the sources
Using AI – ML [3] API and Raspberry- of information which
Pi. chatbot uses were not
mentioned.
Design and Development Procedia 2020 Natural Language Presence of features Low accuracy of the
of Diagnostic Chabot for Computer ToolKit, Decision like Speech proposed system, other
supporting Primary Science Tree Algorithm Recognition. ML algorithms might
Health Care Systems [4] provide better
accuracy.
Self-Diagnosing Health International 2020 NLP Tokenization Chatbot can detect and Chatbot designed for
Care Chatbot using Journal of techniques, Decision categorize disease as limited databases and
Machine Learning [5] Advanced Tree Algorithm, major and minor. Data authenticity was a
Science and SQLite, Bootstrap, major issue.
Technology JavaScript
AI Chatbot Design MDPI 2020 NLP along with Text to Speech feature Works only for
during an Epidemic like Healthcare AIML, Python, and in case of predefined
the Novel Coronavirus Watson emergency chatbot is questionnaires.
[6] capable of sending
patient location and
health conditions to
nearby doctors.
The Smart HealthCare International 2020 JAVA programming Liberty for people to Provides minimal
Prediction using Chatbot Journal of and AIML. chat with a doctor description about
[7] Recent anytime anywhere. symptoms.
Technology and
Engineering
7
International 2019 Language Provides a diet plan Gives wrong output if
Journal of Understanding according to BMI and inputs are not given in
AI Healthcare Interactive Innovative Intelligence Service requirements of the sequential order.
Talking Agent using Technology and (LUIS), NLP and user.
NLP [8] Exploring Morphology
Engineering
Text Messaging-Based Hindawi Journal 2020 NLP, Machine Can suggest diagnosis Not secure against
Medical Diagnosis Using of Healthcare Learning, Twilio, using a direct false positive cases.
Natural Language Engineering Wordnet, YAGO and approach.
Processing and Fuzzy Telegram for
Logic [9] deployment.
Automated Medical SSRN Electronic 2017 JAVA based AIML. Capable of providing Low Accuracy of the
Chatbot [10] remedies system.
Disease Prediction by IEEE Access 2017 CNN based Chatbot is capable of Complex architecture
Machine Learning Over multimodal disease filling the missing with hardly any
Big Data From risk prediction details accurately. acknowledgement of
Healthcare Communities algorithm. underlying novelty.
[11]
Comparing different BMC Medical 2019 Python software PRISMA guidelines Broder level of
supervised machine Informatics and were followed while classification was
learning algorithms for Decision Making selecting the articles. considered rather
disease prediction [12] considering sub
classification or hyper
parameters
Application of Machine IEEE-Xplore 2018 Logistic Regression, Chatbot helps in early Involves time
Learning in Disease Decision Tree, SVM, detection of chronic consuming steps like
Prediction [13] Random Forest, and fatal diseases like data mugging which
Adaptive Boosting, Breast Cancer, Heart can be automated.
Prima Indians attack and Diabetes.
Diabetes Database,
Wisconsin Breast
Cancer Dataset,
Heart Disease
Dataset
An intelligent Chatbot Science Direct 2020 Tensorflow, Neural Open Domain Chatbot High hardware
using deep learning with Materials Today: Machine Translation, which can be subjected configuration required
Bidirectional RNN and Proceedings BRNN, Reddit to any specific domain for the functioning of
attention model [14] database if needed. the system.
Disha: An IEEE-Xplore 2019 Named Entity Developed in Local No option for other
Implementation of Recognition, ML, language which is languages like English.
Machine Learning Based Google API. Bangla
Bangla Healthcare
Chatbot [15]
8
METHODOLOGY collectively. Figure 1 depicts a comparison of the
composition of the 45 screened records deriving
Extensive research efforts were made to identify from the initial selection of titles.
articles meeting our criteria of AI chatbot focusing
on NLP and ML. A comprehensive search strategy The subsequent step is the manual inspection of
was followed to find out all related articles. The all recovered articles. For this, we made four
titles used for searching were: decision criteria. As shown in Figure 1 they were
yes/no binary classifiers, stated as – Articles
● Healthcare chatbot published after 2017 were chosen by looking at the
● Machine learning in healthcare acceptance dates of each paper sequentially, going
● Disease prediction using AI/ ML through abstracts of each paper and selecting the
● AI in healthcare
ones having most relevance to our field, rejecting
The aforementioned titles were used to filter the articles having an impact factor of less than 0.5
records on major online academic search engines and cite of less than 1. Choosing papers from
for science fields, including Google Scholar, recent years delivers all of the newest up-to-date
Science Direct, Pubmed, IEEE Xplore, and Scopus information in the field of study, while the impact
Indexed. They had been selected because of factor of an academic journal reflects the yearly
imparting research papers with a high degree of average number of citations of articles published in
accuracy and consistency. They are free the last two years in a given journal and is used to
publication search engines incorporating millions evaluate the merit of articles. The final dataset
of citations and comprising more than 500 million contained 15 articles, each of which covered topics
documents, life science journals and online books of AI chatbot in healthcare. All implemented
variants will be discussed in the following section.
Figure 1: Criteria for selecting research papers

9
RESULT AND DISCUSSION have the risk of disease is predicted to have no
disease risk. To determine the best classifier and
The development of a chatbot can be an onerous
improve the accuracy of the model, the 10-fold
work, with incorrect knowledge base. Based on
cross-validation method is used for the training
selected 15 articles in terms of the methods used,
set, and data from the test set is not used in the
performance measures as well as the disease they
training phase. Nicholas A. I. Omoregbe et
targeted deep studies were done. One of the
al suggested another performance metric i.e
fundamental building blocks of a chatbot is its
system usability scale (SUS). The SUS score can
architecture. We have seen that the microservices
assess usability in terms of effectiveness,
architecture as proposed by S. Roca et al. [1]
efficiency, and overall usability.
accomplishes: Modularity which permits new
functionalities to be delivered dynamically and
Healthcare Chatbots effectively use machine
independently in the architecture, Standardization
learning to forecast diseases, schedule
guarantying the interoperability between different
appointments, as well as provide clinical advice.
systems and Security in consideration of overall
However, there are still a significant number of
security such as database configuration, known
individuals who have the same impression every
service vulnerabilities, hence forth. We have seen
time they have a conversation with a bot: "It
data playing a crucial role in determining the
doesn't comprehend what I'm saying." This is
accuracy of the machine learning algorithm. M.
where Natural Language Processing (NLP) comes
Chen et al. [11] highlighted the role of
into play. NLP can pave the way for an easier-to-
demographics in clinical data. Different regions
use interface to features and services in the right
exhibit unique characteristics of certain regions,
context and for the right applications. But, more
primarily because of the diverse climate and living
crucially, an NLP-based chatbot may offer end
habits in the region, which may weaken the
users on the other side of the screen the
prediction of disease outbreaks. Risk prediction
impression that they're having a conversation
accuracy is determined by the data's diversity
rather than navigating a restricted set of options
feature, i.e., the better the disease's feature
and menus to get to their desired outcome.
description, the greater the accuracy. Thousands
Tokenization (the process of segmenting text into
of gigabytes of data are used in machine learning.
words, phrases, or sentences) is one of the NLP
Despite the fact that the algorithms are designed to
strategies that is described in [3,5,8]. Stemming is
handle vast volumes of data, the clinical data of
the process of reducing linked words to a single
patients is sensitive and requires additional
stem. Stop words are terms that are often used but
attention. Now, machine learning, unlike
are unlikely to be effective for learning. Named
traditional programming, is expected to fail on a
entity recognition (NER) identifies significant
given number of samples or features and therefore
aspects in a text, such as people's names and
assessing accuracy of the algorithm becomes
locations.
critical. There are several ways of determining
algorithm accuracy, some of which were Apart from making the chatbot dialogue human-
suggested by; S. Uddin et al [11]. The confusion like, the language used by the chatbot must also be
matrix and the receiver operating characteristic addressed. According to a report by The
(ROC) curve have traditionally been used to Washington Post, only about 7.5 % of the world’s
evaluate a classifier's diagnostic abilities. The true population considers English their native
positive rate is plotted against the false positive language. This is also one of the most compelling
rate at various threshold values to construct a arguments for making the chatbot multilingual,
ROC. The area under the ROC curve (AUC) is particularly in India. India is a multi-religious
another popular metric for determining a country with people speaking a variety of
classifier's predictability. A greater AUC value languages such as Marathi, Rajasthani, Bengali,
denotes a classifier's superiority, and vice versa. In Gujrati, English and among others. Building
medical data, we pay more attention to the recall multilingual chatbots can enable it to reach a
rather than accuracy. The higher the recall rate, larger audience. A multilingual chatbot provides
the lower the probability that a patient who will multiple language support to users and can
10
interact with customers in multiple languages. M. language. Google API can be used to transform
Rahman et al. [15] developed a Bangla healthcare the English terms in the database to respective
chatbot aiming to offer help in Bengal’s native targeted language.
CONCLUSION Different evaluation methods can be implemented,

namely Accuracy, Precision, and Recall, Receiver
Through deep analysis, this survey
operating characteristic (ROC) curve, area under
paper attempted to study and develop a better
curve (AUC), and F1-Measure [4].The use of both
understanding of healthcare chatbots. A chatbot's
structured and unstructured data can help
primary constituents are NLP for user dialogue,
determine if a patient is in a high-risk category for
database and ML for disease prediction. The
a disease with complex symptoms. By leveraging
architecture of a chatbot is a frequently
voice command, Google API can play a pivotal
overlooked component. The use of microservices
role in establishing the chatbot as a virtual
architecture [1] provides a good flexible solution
specialist that rural communities will be able to
for personalized monitoring services addressing
benefit from. When it comes to handling patient
the challenges of personalization and data storage
data, security and data protection are critical
in various health scenarios. It is scalable, provides
considerations that must be addressed [3]. In
foster data sharing and reusability along with
conclusion, a chatbot using NLP requires an
standard conversational modeling. It further
adequate machine learning algorithm as well as a
reduces the expenses in the healthcare sector
large database focused on demographics along
while improving the quality of life of patients and
with descriptive symptoms and other features. The
lowering the number of face-to-face medical
chatbot can be developed to be more users
consultations. The architecture mentioned can be
friendly by using Google API for voice commands
implemented in the future modules. The NLP
and government and media assistance in spreading
aspect of the project ought to focus on assuring
awareness.
users of the human dimensions of AI, aiming to
increase the adoption rate of these services. The
mechanism of action and clinical effectiveness as
REFERENCES
an intervention can be mentioned to all users to
eliminate AI hesitancy [3,5,8]. Government and [1] S. Roca, J. Sancho, J. Garcia and A. Alesanco,
media advertisements can play a crucial role in “Microservice chatbot architecture for chronic
engaging users, spreading awareness, gaining trust patient support,” Journal of Biomedical
as well as eliminating AI hesitancy [2]. The Informatics, vol. 102, Oct., pp. 1–28, 2019
anticipated disease and diagnosis for the expected
symptoms are provided by a health chatbot. In [2] T. Nadarzynski, O. Miles, A. Cowie and D.
subsequent generations, the chatbot's symptom Ridge, “Acceptability of artificial intelligence
detection and analysis ability will be greatly (AI)-led chatbot services in healthcare: A mixed-
enhanced by providing symptom description. methods study,” Digital Health, vol. 5, no. 7,
Machine learning can be introduced as a scientific Aug., pp. 331–341, 2019
discipline that focuses on how computers learn [3] D. Balasubramaniam, C. Kanmanipappa, S.
from data and continuously improve themselves. Velmuruganand and M. Saravanan, “Design and
There are several other types of machine learning Development of Smart Healthcare Chatbot
algorithms, however a few of the more frequent Application Using AI – ML,” Journal of Natural
ones mentioned in the papers were Logistic Remedies, vol. 21, Nov., pp. 13–19, 2020
Regression, Decision Tree and Random Forest,
Support Vector Machine, and Adaptive Boosting. [4] B. Kidwani, Nadesh RK, “Design and
In order to make a final decision, these algorithms Development of Diagnostic Chabot for supporting
must be evaluated. A variety of evaluation Primary Health Care Systems,” Procedia
methods are used to pick the optimum algorithm Computer Science, vol. 167, pp. 75– 84, 2020
and dataset combination for disease prediction. [5] S.A. Kumar, C.V. Krishna, P.N. Reddy, B.RK.
11
Reddy, I.J. Jacob, “Self-Diagnosing Health Care [15] M. Rahman, R. Amin, N. Khan, N. Hossain,
Chatbot using Machine Learning ,” International “Disha: An Implementation of Machine Learning
Journal of Advanced Science and Technology, vol. Based Bangla Healthcare Chatbot,” IEEE-Xplore,
29,No. 5, pp. 7323– 9330, 2020 Available: Disha: An Implementation of Machine
Learning Based Bangla Healthcare Chatbot (sci-
[6] G. Battineni, N. Chintalapudi, F. Amenta, “AI hub.do), December 2019
Chatbot Design during an Epidemic like the Novel
Coronavirus,” MDPI Healthcare, pg.8-154, 2020 [16] HL7.org, “Welcome to FHIR”,
http://hl7.org/fhir, Nov. 1, 2019.
[7] K Jayashree, Monika K A, Preetha R, [Online].Available: http://hl7.org/fhir/index.html
Piraisoodan S P, “The Smart HealthCare [Accessed May 31, 2021]
Prediction using Chatbot”, International Journal
of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-9 Issue-2, July 2020
[8] M.S Bennet Praba, Sagari Sen, Chailshi
Chauhan, Divya Singh, “Ai Healthcare Interactive
Talking Agent using Nlp”, International Journal
of Innovative Technology and Exploring
Engineering (IJITEE) ISSN: 2278-3075, Volume-
9 Issue-1, November 2019
[9] Nicholas A. I. Omoregbe, Israel O. Ndaman,
Sanjay Misra, Olusola O. Abayomi-Alli, Robertas
Damasevicius, “Text Messaging-Based Medical
Diagnosis Using Natural Language Processing
and Fuzzy Logic”, Hindawi Journal of Healthcare
Engineering, Article ID 8839524, September 2020
[10] K. Rarhi, A. Mishra and K. Mandal,
“Automated Medical Chatbot,” SSRN Electronic
Journal, vol 21, Jan 2017
[11] M. Chen, Y. Hao, , K. Hwang, L. Wang and
Lin wang, “Disease Prediction by Machine
Learning Over Big Data From Healthcare
Communities,” IEEE Access, vol. 5, Apr., pp.
8869– 8878, 2017
[12] S. Uddin, A. Khan, M. Hossain and M. Moni,
“Comparing different supervised machine learning
algorithms for disease prediction,” BMC Medical
Informatics and Decision Making, vol. 19, Dec.,
pp. 1004– 1020, 2019
[13] P.S Kohli, S. Arora, “Application of Machine
Learning in Disease Prediction,” IEEE-Xplore,
Available: Application of Machine Learning in
Disease Prediction (sci-hub.do), 2018
[14] Manyu Dhyani, Rajiv Kumar, “An intelligent
Chatbot using deep learning with Bidirectional
RNN and attention model”, Science Direct
Materials Today: Proceedings, Volume 34, Part 3,
pp. 817-824, June 2020
12

VI Sem Project - Instant Access To Healthcare Using AI - Voice Enabled Chat Bot - Group No - 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VI Sem Project - Instant Access To Healthcare Using AI - Voice Enabled Chat Bot - Group No - 3

Uploaded by

Copyright:

Available Formats

P r o jec t Report

Shri Ramdeobaba College of Engineering & Management, Nagpur

(An Autonomous Institute Affiliated to Rashtrasant Tukdoji Maharaj Nagpur University)

for partial fulfillment of the degree in

S IMRAN S INGH (23)

P ARTHSARTHI P AHUJA (54)

Y ASH G UPTA (70)

Under the Guidance of

Dr. D.S. Adane

Department of Information Technology

Shri Ramdeobaba College of Engineering & Management,

Instant Access to Healthcare using AI - Voice Enabled Chat Bot

Shri Ramdeobaba College of Engineering & Management, Nagpur

(An Autonomous Institute Affiliated To Rashtrasant Tukdoji Maharaj Nagpur University)

Simran Singh, Parthsarthi Pahuja, Yash Gupta

For partial fulfillment of the degree in

Bachelor of Engineering in Information Technology,

during the academic year 2020- 21

under the guidance of

Dr. D.S. Adane

Dr. D. S. Adane Dr. R. S. Pande

Department of Information Technology

Shri Ramdeobaba College of Engineering &

It is our proud privilege to present a project report on “Instant Access to Healthcare

A special word of thanks goes to Entire Department of Information Technology, RCOEM,

Sr. No. Description Page

Figure 1.2 Use cases of bots in AI 4

Figure 2.2.1 Chatbot checking symptoms 8

Figure 2.2.2 Suggesting healthcare service 8

Figure 2.2.3 Chatbot suggesting Medication 9

Figure 2.2.4 Booking appointment using Chatbot 9

Figure 4.3.1 Chatbot Architecture

Figure 5.3.1 Speech recognition code 21

Figure 5.3.2 Text Pre-processing code 21

Figure 5.3.3 Output of NLP Methods 22

Figure 6.3.1 Execution of Random Forest 33

Figure 6.3.2 Sample input to the code 33

Figure 6.3.3 Output of the following code 34

Figure 6.3.4 Execution of K-Nearest Neighbor 35

Figure 6.3.5 Sample input to the code 35

Figure 6.3.6 Output of the following code 36

Figure 7.3.1 Code for Web Scrapping 39

Figure 7.3.2 Code for Exporting Scrapped Data to CSV File 39

Figure 7.3.3 Snapshot of Cleaned Training.csv File 40

Figure 7.3.4 Snapshot of Cleaned Testing.csv File 40

Sr. No. Description Page No.

1.1 INTRODUCTION TO CHATBOT

Advisors: Integrate customized answers to complex requests with automated processes to

Figure 1.1: Example of conversational bot

1.2 ARTIFICIAL INTELLIGENCE IN MEDICINE

1. What is Artificial Intelligence?

2. Artificial intelligence in medicine

1. Unsupervised (ability to find patterns)

2. Supervised (classification and prediction algorithms based on previous examples)

3. Reinforcement learning (use of sequences of rewards and punishments to form a

Figure 1.2: Use cases of bots in AI

3. Applications of Artificial intelligence in Healthcare

 Reduce diagnostic, therapeutic and human errors.

1.3 FUTURE SCENARIO FOR INDIA

 Current status of medical records

 Data need to be captured in real-time, and institutions should promote their

 Each result must be questioned for its clinical applicability.

 Aim of increasing their clinical value and decreasing health costs