You are on page 1of 62
HUMAN DISEASE PREDICTION A PROJECT REPORT Submitted by GOKUL SAI VEGIREDDY(RA2011050010020) JAVVADI ROHITH VENKATA KRISHNA(RA2011050010038) Under the Guidance of B.PRABHU KAVIN (Assistant Professor, Department of) In partial fulfillment of the Requirements for the Degree of BACHELOR OF TECHNOLOGY COMPUTER SCIENCE ENGINEERING @SRM DEPARTMENT OF FACULTY OF ENGINEERING AND TECHNOLOGY SRM INSTITUTE OF SCIENCE AND TECHNOLOGY KATTANKULATHUR — 603203 NOVEMBER 2023 BONAFIDE CERTIFICATE Certified that this project report titled “Human Disease Prediction” is the bonafide work of Gokul ‘Sai vegireddy(RA2011050010020),Javvadi Rohith Venkata Krishna(RA2011050010020)who carried out the project work under my supervision. Certified further, that to the best of my knowledge the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or award was conferred on an earlier occasion for this or any other candidate. Dr.B.Prabhu Kavin Dr.Lakshmi Assistant Professor Head of the Department Dept. of DSBS Dept. of DSBS Department of @® S RM ‘SRM Institute of Science & Technology sere (Own Work Declaration Form Degree! Course + Bachelor of Technology, Computer Science Engineering Student Name : Gokul S: fegireddy , Javvadi Rohith Venkata Krishna Registration Number + RA2011050010020,RA2011050010038 Title of Work, +: Human Disease Prediction We hereby certify that this assessment compiles with the University's Rules and Regulations relating to Academic misconduct and plagiarism, as listed in the University Website, Regulations, and the Education Committee guidelines. ‘We confirm that all the work contained in this assessment is my / our own except where indicated, ‘and that we bave met the following conditions: Clearly references / listed all sources as appropriate Referenced and putin inverted commas all quoted text from books, web, etc) Given the sources of all pictures, data ete. that are not my own Not made any use of the report(s) or essay(s) of any other student(s) either past or present ‘Acknowledged in appropriate places any help that | have received from others (eg. fellow students, technicians, statisticians, external sources) © Compiled with any other plagiarism criteria specified in the Course handbook / University website 1 understand that any false claim for this work will be penatized in accordance with the University policies and regulations ‘Tam aware of and understand the University’s policy on Academic misconduct and plagiarism and certify that this assessment is my / our own work, except where indicated by referring, and that | have followed the good ‘academic practices noted above. ACKNOWLEDGEMENT We express our humble gratitude to Dr C, Muthamizhehelvan, Vice-Chancellor, SRM Institute of Science and Technology, for the facilities extended for the project work and his continued support. We extend our sincere thanks to Dean-CE’ T.N.Gopal, for his invaluable support | SRM Institute of Science and Technology, Dr We wish to thank Dr Revathi Venkataraman, Professor & Chairperson, School of Computing, SRM Institute of Science and Technology, for her support throughout the project work. We are incredibly grateful to our Head of the Department, Professor, Department of Data Science and Business Systems, SRM Institute of Science and Technology, for her suggestions and encouragement at all the stages of the project work. We want to convey our thanks to our program coordinators , Professor, Department of Data Science and Business Systems, SRM Institute of Science and Technology, for her inputs during the project reviews and support, We register our immeasurable thanks to our Faculty Advisor, Associate Professor, Department of Data Science and Business Systems, SRM Institute of Science and Technology, for leading and helping us to complete our course. Our inexpressible respect and thanks to my guide, , Assistant Professor, Department of Data Science and Business Systems, SRM Institute of Science and Technology, for providing me with an opportunity to pursue my project under his/her/their mentor-ship. He/She/They provided me with the freedom and support to explore the research topics of my interest. Her/His/Their passion for solving problems and making a difference in the world has always been inspiring. y thank the Data Science and Business Systems staff and students, SRM Institute of Science and Technology, for their help during our project. Finally, we would like to thank parents, family members, and friends for their unconditional love, constant support, and encouragement. GOKUL SAI VEGIREDDY|[RA2011050010020), JAVVADI ROHITH VENKATA KRISHNA[RA2011050010038] Abstract The integration of machine learning (ML) techniques in healthcare has opened new avenues for disease prediction and diagnosis. This study presents an innovative approach to human disease prediction employing four well-established ML algorithms: k-Nearest Neighbors (k-NN), Random Forest, Naive Bayes, and Decision Tree. The focal point of this model is its reliance on symptom-based classification, aiming to provide a robust and early predictive tool for a diverse range of medical conditions, In the initial stages, a comprehensive dataset is collected, encompassing patient records that include detailed symptom information along with confirmed disease labels. The dataset undergoes meticulous preprocessing to address missing values, normalize features, and encode categorical variables, ensuring the quality of the data for subsequent model training and evaluation. The selection of relevant features plays a critical role in the model's accuracy and interpretability. Specific symptoms are carefully chosen to serve as features in the ML models. These models are then developed and trained using the k-NN, Random Forest, Naive Bayes, and Decision Tree algorithms. During the training phase, the models learn intricate pattems associating symptoms with specific diseases. In the results analysis, the performance of each algorithm is assessed using key metrics such as accuracy, precision, recall, and F1 score. This comparative evaluation provides insights into the strengths and weaknesses of each algorithm in the context of disease prediction. DEX S.No. Topic Page. No. 1 ‘Acknowledgement i I Abstract ii IV List of Abbreviations iv 1 Introduction 1 2 Literature Survey 2 3 Objective 10 4 Innovation Component n 5 System Design 13 6 Workdone and Implementation 4 7 Algorithms 21 8 Results and Analysis 38 9 Source Code 4 10 Conclusion 34 i References a5 List of Abbreviations 1. AI: Artificial Intelligence 2.KNN: k-Nearest Neighbors 3.RF: Random Forest 3.DT: Decision Tree 4.NB: Naive Bayes 5.ML: Machine leaning 6.R&D: Research and Development, Chapter 1 Introduction This Covid Pandemic has forced a large majority of individuals to stay inside, for their own safety. The Frontline Workers & Hospital Staff are relatively more prone to getting the disease as a result of the nature of their job. Consequently, visiting hospitals even for regular diagnosis can be risky. Due to this, a patient, who feels he/she is having certain symptoms cannot get checked by medical professionals. We aim to bridge this gap with our innovative idea, which brings to the patients a diagnostic tool, right onto their devices. This Tool runs some of the most powerful machine learning algorithms and aggregates their results to provide an optimum disease prediction of a person having some set of symptoms. The users can navigate through the very intuitive website that we've built to select the symptoms they feel they might be vulnerable to. The Application then runs their specific numbers through the four algorithms, to produce the optimal disease prediction. The simplified user interface and an easy approach to finding out the disease makes the project applicable to a large number of applications, even other than just for patients. The doctors & hospital staff can run the numbers through our application to find out if their prediction is accurate. We have used 4 different algorithms for this purpose. We have also designed an interactive interface to facilitate interaction with the system. We have also attempted to show and visualize the result of our study with this project. This tool essentially helps patients & a number of other kind of users, to verify or predict various life-threatening diseases in their earlier stages, so that the right measures can be taken by the patient in the correct time, thus helping the one’s at risk stay risk-free at their homes Chapter 2 Literature Survey > 2.1. Efficient heart disease prediction system using optimization technique Authors: Chaitanya Suvara, Abhishek Sali, Sakina Salmani Year and reference: IEEE 2017 ICCMC Concept! Theoretical model/Framework: Computers in healthcare enterprise hospitals are used to collect massive amount of data about patients and their illnesses. The latent developments and connections withinside the statistics are regularly ignored. It is a frightening activity to become aware of cardiovascular illnesses in patients, and there are only a few medical doctors who can reliably predict such illnesses. With the help of data mining and optimization approaches, this research paper specializes in developing a prediction algorithm Methodology used/Implementation: It uses the technique of Particle Swarm Optimization, which is an inherently distributed algorithm where the solution to a problem arises from the interactions between several simple particles called individual agents. The data source used for experimental research is widely used and considered to be a de facto standard for the reliability rating of heart disease prediction. Also used is a slightly changed PSO variant with a constriction element called Constricted PSO. The results obtained show that Particle Swarm Data Mining Algorithms are competitive and can be successfully applied to the prediction of heart disease, not only with other evolutionary techniques, but also with industry standard algorithms. nv Dataset details/Analysis: The data set is taken from the Data Mining Repository of the University of California, Irvine (UCI) (Newman et al., 1998). Datasets has, totally, 14 attributes such as Age, sex, chest pain type, resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels, thal and diagnosis of heart disease are presented. Limitations Future Research: Authors are limiting the velocity results in limiting the distance traveled by the particle.Even more Better performance can be obtained by pre-processing the data and post-processing the data. As future scope techniques like Principal Component Analysis to pre-process data and reduce variance will be used. Reinforcement Leaming could also be used so that the system keeps on improving as we use it. Relevant Finding: In the current study authors are stimulated from Particle Swarm Optimization strategies to expect severity of heart disease (0-4).PSO is stimulated withinside the intelligent behaviour of beings as a part of an experience sharing community as opposed to an isolated individual reactive reaction to the environment. It may be concluded that the best method to apply with particle swarm optimization is to make use of the constriction factor approach while limiting the velocity thus limiting the distance traveled through the particle. > 2.2. Algorithm selection for classification problems Authors: Nitin Pise, Parag Kulkarni Year and reference: SAl Computing Conference 2016 July 13-15, 2016 Concept! Theoretical model/Framework: There is an increasing number of algorithms and practices that may be used for the very equal application, With the explosion of available learning algorithms, a technique for assisting user selecting the most suitable algorithm or combination of algorithms to solve a problem is turning into an increasing number of important. In this paper they are using meta learning to relate the overall performance of machine learning algorithms at the different datasets. The paper concludes by proposing the system that may study dynamically as according to the given data. Methodology used/Implementation: + Dataset Collection + Meta-features are extracted for training from DCT + Leaming algorithm with performance measure such as accuracy + Generate Knowledge Base + Meta-features extraction of a new dataset using DCT. + Find K-similar datasets from knowledge base. + Ranking of algorithms + Recommendation of algorithm for prediction Dataset details/Analysis: Thirty eight benchmark datasets are obtained from the University of California Irvine Machine Learning Repository used in experiments. The dataset characteristics are related to the type of problem. Based on the classification task the number of classes, the entropy of the classes and the percent of the mode category of the class can be used as useful indicators. Limitations Future Research: Investigate more on proposed method and test extensively on other datasets. Meta Learning helps improve results over the basic algorithms. Using Meta Characteristics on the Adult dataset to determine an appropriate algorithm, almost 85% correct classification is achieved for the LogitBoost algorithm. Relevant Finding: Experiments are performed on 38 real world datasets, 9 classifications had been used from different categories and 3 data characterization strategies or meta-functions had been used for experimentation. The experimental work indicates that for 90% datasets, the predicted and real accuracies intently match. Hence the algorithm selection or advice is accurate for those datasets. > 2.3, Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques Authors: Mohan, Senthilkumar, Chandrasegar Thirumalai, and Gautam Srivastava. Year and reference: IEEE Access journal June 19, 2019 Concept! Theoretical model/Framework: One of the most important reasons for the loss of life in the world nowadays is heart disease. In the sector of medical records analysis, cardiovascular disease prediction is an essential problem. Machine learning (ML) has been proven to be efficient in assisting to make choices and predictions primarily based totally on the massive quantity of healthcare industry data generated, The prediction version is introduced with different combinations of features, and several known classification techniques. We produce a more suitable overall performance level with accuracy level of 88.7% through the prediction model for heart disease with Hybrid Random Forest with Linear Model (HRFLM). Methodology used/Implementation: A computational approach with the three association rules of mining namely, apriori, predictive and Tertius to find the factors of heart disease on the UCI, HRFLM makes use of ANN with back propagation along with 13 clinical features as the input. Machine leaming techniques were used in this work to process raw data and provide a new and novel discernment towards heart disease. There are four databases (i.e.Cleveland, Hungary, Switzerland, and the VA Long Beach). The Cleveland database was selected for this research because it is a commonly used database for ML researchers with comprehensive and complete records. The dataset contains 303 records. Although the Cleveland dataset has 76 attributes, the data set provided in the repository furnishes information for a subset of only 14 attributes, Limitations Future Research: Further extension of this study is highly desirable to direct the investigations to real-world datasets instead of just theoretical approaches and simulations. The proposed hybrid HRFLM approach is used combining the characteristics of Random Forest (RF) and Linear Method (LM) > 2.4, Prediction of Heart Disease Using Machine Learning Authors: Aditi Gavhane, Gouthami Kokkula, Isha Pandya, Prof. Kailas Devadkar (PhD) Year and reference: ICECA 2018 IEEE Conference Record # 42487. Concept! Theoretical model/Framework: We want to put a mechanism in place with the rampant rise in the rates of heart stroke at adolescent ages to be able to identify the symptoms of a heart stroke at an early stage and thereby keep away from it. It is impractical for a common guy to often go through highly-priced tests such as the ECG, so in predicting the risks of a heart attack, there needs to be a method in place that is handy and at the same time accurate. [4] In this paper, consequently they're proposing to create an application that, given fundamental signs and symptoms consisting of age, sex, pulse rate, etc. may predict the vulnerability of a heart disease. The machine learning algorithm for neural networks has proved to be the most precise and effective algorithm used of their proposed method. Methodology used/ Implementation: In the proposed system authors used the neural network algorithm multi-layer perceptron (MLP) to train and test the dataset. Multi-layer perceptron algorithm is a supervised neural network algorithm in which there will be one layer for the input, second for the output and one or more for hidden layers between these two layers. Dataset details/Analysis Datasets used is Cleveland dataset from UCI library. The dataset contains as many as 76 parameters describing the complete health status of heart. These parameters are obtained by expensive clinical tests like ECG, CT scan etc. Out of these, the traditional heart disease prediction system uses 13 major parameters. Limitations Future Research: Algorithms can be modified to achieve more accuracy and reliability. The Big Data Technology like Hadoop can be used to store huge chunks of data of all the users worldwide and to manage the data or reports of the user; technologies like Cloud Computing can be made use of.In future The similar prediction systems can be built for various other chronic or fatal diseases like Cancer, Diabetes, etc with the help of recent technologies like machine learning, fuzzy logics, image processing and many others. Relevant inding: The output of the system will deliver a prediction end result if the person has a heart ailment, in phrases of Yes or No. The system gives an idea about the heart status leading to CAD beforehand. If the person is liable to have heart disease then the end result acquired might be Yes and vice versa. In case of an positive output, he needs to consult a heart specialist for further diagnosis. > 2.5. Boosted Voting Scheme on Classification Authors: Chien-Hsing Chen, Chung-Chian Hsu Year and Reference: IEEE, 2008 Concept! Theoritical model/Framework: New ensemble scheme is proposed which focuses on driving the relationship among machine learning algorithms and variation data distributions, The advantage of the framework can form an expressive hypotheses combination permitting a set of learning algorithms with respect to the data distributions, in preference to majority voting scheme which was normally employed for enhancing the prediction stability or a weak learning algorithm needed in bagging/boosting/random-forest algorithm. Methodology used/ Implementation: Combined a set of learning algorithms with variant data distributions, instead of only a majority voting scheme which was commonly employed for improving the prediction stability or a weak classification algorithm used in boosting algorithms. The weighting of the majority voting in the boosted voting scheme allowing a set of learning algorithms for integrating into a winner vote is very interesting. Dataset details/Analysis: Datasets are selected from UCI benchmarks like Glass, Hepatitis, Iris, Statlog, Zoo. Authors divided randomly the examples into the two-third training dataset and the one-third testing dataset for a classification task in each experiment. Limitations /Future Research! Gaps identified: In this paper authors did not give a discussion in increasing the value of T, The determination of the parameter T is a critical issue. It is worth exploration to drive the optimal value of T in practical applications. We plan to study for the optimal parameter T in the future.Further study is based on the weight assignment for the participatory learning algorithms in the future. Relevant inding: Results indicate that with an appropriate parameter T, the boosted voting scheme regularly outperforms the majority voting scheme. The studies outcomes withinside the literature have mentioned that a majority voting scheme was an effective scheme in integrating the votes from several classifiers. Importantly, the experiments presented on this paper show that the boosted voting scheme that's proposed will be valuable in the classification task and gains a better overall performance than a majority voting scheme. > 2.6. A Quick Review of Machine Learning Algorithms Authors: Susmita Ray Year and Reference: 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (Com-IT-Con) Concept! Theoretical model/Framework: In this paper the author intends to do a short overview of various machine learning algorithms that are most frequently used and therefore are the most famous ones. The author intends to highlight the merits and demerits of the machine learning algorithms from their application perspective to aid in an informed choice making towards choosing the appropriate learning algorithm to fulfil the specific requirement of the application. Methodology used/ Implementation: In this paper an attempt was made to review most frequently used machine learning algorithms to solve classification, regression and clustering problems. The advantages, disadvantages of these algorithms have been discussed along with comparison of different algorithms (wherever possible) in terms of performance, learning rate etc. Along with that, examples of practical applications of these algorithms have been discussed. Types of machine rec techniques namely supervised learning, unsupervised learning, semi supervised learning, have been discussed. It is expected that it will give insight to the readers to take an informed decision in identifying the available options of machine learning algorithms and then selecting the appropriate machine learning algorithm in the specific problem solving context. Relevant Finding: In our entire project, we are planning to use the algorithms present in this paper, and these algorithms will be described in great detail in our project. Machine learning is primarily a field of artificial intelligence that has gained considerable interest in the digital arena as a key component of digitalization solutions. [6] The author intends to do a brief analysis of various machine learning algorithms in this paper, which are most commonly used and are therefore the most common. From its implementation point of view, the author aims to highlight the strengths and demerits of machine leaming algorithms to assist in informed decision-making in selecting the best learning algorithm to satisfy the particular requirements of the application. Chapter 3 Objective The objective of this project is to create a tool to help people to diagnose diseases like Diabetes, Jaundice, Dengue, Typhoid, Alcoholic hepatitis, Malaria, Chicken pox, Tuberculosis, Pneumonia, Common Cold, etc. The user should be able to input symptoms they are suffering with, and then the tool should predict the disease by running the most powerful machine leaming algorithms. Every disease should be predicted using 4 algorithms: Decision Tree, Random Forest, Naive Bayes, and K-Nearest Neighbours. Unique feature we aim to include in our project is: to use majority voting in our project to produce final prediction. Even with majority voting, the system may arise with five different cases while aggregating the final result of prediction. We aim to develop our system in such a way that it produces ideal results in every case. Machine Learning can review large volumes of data and can discover specific pattems and trends that would not be computable by humans. This enhanced computation power of Machine Learning helps companies around the globe to carry out concurrent processes. In this Covid Pandemic era, it becomes increasingly hard to visit a hospital with many risks involved. People having symptoms of certain diseases are especially in doubt regarding finding the best method to get a quick and reliable analysis of their symptoms. While Online doctors are available, they have an associated risk factor with them as well, some of which involve fake doctors, less involvement from the doctor, which might lead to wrong analysis of a patient. In such a scenario, a simple online tool, which may predict the possible underlying condition on the basis of a few symptoms is the need of the hour. A system which would not only use a single method of classification, but rely on multiple classifications to find the correct underlying disease with a high accuracy. We aim to create such a solution for the general public, with a wide variety of use cases, this would help uplifting people by providing extensive remote diagnosis for them Chapter 4 Innovation Component Other existing disease prediction projects predict the diseases based on any single algorithm in which we may or may not get the accurate results all the time because the problem here is no machine learning algorithm works best for every problem. There are many factors at play, such as the size and structure and number of records in the dataset. We aim to overcome this drawback with our innovative idea, which brings to the patients a diagnostic tool which runs based on 4 different powerful machine learning algorithms and aggregates the optimal result by majority voting of results of all the algorithms. Our tool runs their specific numbers through these four algorithms, to produce the optimal result which is more optimal than other existing algorithms. Every disease is predicted using 4 algorithms: Decision Tree, Random Forest, Naive Bayes, and K-Nearest Neighbour. While using a hold-out “test set” of data to evaluate performance and select the ideal result. While aggregating the final result, the system may arise with 5 different cases, which are as follows: Case-1: All the 4 predictions (4 different algorithm results) may be the same, say ‘disease-a’. In this case the System will display the ‘disease-a’ as Final Prediction when the ‘Final Prediction’ button is clicked. Case-2: Any 3 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 4th prediction. In this case the System will display the ‘disease-a' as Final prediction when the ‘Final Prediction’ button is clicked. Case-3: Any 2 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 2 predictions (remaining 2 predictions different with each other). In this case the System will display the ‘disease-a! as Final prediction when the ‘Final Prediction’ button is clicked. Case-4: Any 2 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 2 predictions, which are the same indeed (say ‘disease-b’). In this case the System will select the disease which is not predicted by algorithm with lowest accuracy compared to remaining algorithms and display it as Final prediction when the ‘Final Prediction’ button is clicked, Case-5: All the 4 predictions may be completely different from each other. In this case the System will display the all the 4 diseases except the one that is predicted by algorithm having lowest accuracy as Final prediction when ‘Final Prediction’ button is clicked Chapter - 5 System Design 1 eo 34 i *, i u Nine Chapter - 6 Work Done And Implementation Meth iE Step-1: User will enter his/her name. ‘Step-2: The user has the flexibility of choosing between 2 to 5 symptoms for diagnosis. Step-3: If the number of symptoms selected is >=2 and length of the name is >=1 then User can select any of the four algorithms as per his/her wish for diagnosis. ‘Step-4: If the user selected algorithm is decision tree, then > DecisionTreeClassifier() in skleam library is used to train the model and predict the disease on testing dataset according to symptoms entered by the user. If the user selected algorithm is random forest, then > Using sklearn.Ensemble library RandomForestClassifier is imported to train the model and predict the disease on testing dataset according to symptoms entered by the user. N_estimators in RandomForestClassifier is set to 100 which means algorithm should build 100 trees before taking the maximum voting. If the user selected algorithm is Naive bayes then > GaussianNB() in sklearn.naive_bayes library is used to train the model and predict the disease on testing dataset according to symptoms entered by the user. If the user selected algorithm is K-nearest neighbor > KNeighborsClassifier() in sklearn.neighbors library is used to train the model and predict the disease on testing dataset according to symptoms entered by the user. Number of neighbors is set to 5, Power parameter for the Minkowski metric is set to 2 which means euclidean_distance. Step-5: After selecting all algorithms for diagnosis user can select final prediction Step-6: In final prediction the system may arise with 5 different cases, and in all the cases the system will produce the ideal result based on majority voting and accuracy of each algorithm Step-7: The results are then stored safely on the sqlite3 database. Based on the symptoms nearly 40 Diseases like Diabetes, Jaundice, Dengue, Typhoid, Alcoholic hepatitis, Malaria, Chicken pox, Tuberculosis, Pneumonia, Common Cold, etc., are covered in this model. 6.2.Datasets and References: 6.2.1.Dataset used: Dataset for this project was collected from a study of the University of Columbia performed at New York Presbyterian Hospital during 2004. Link of dataset is given below: mi 6.2.2. Comparison With Reference project: Referenced project runs on a single disease prediction algorithm in which we may or may not get the accurate results all the time whereas, we aim to overcome this drawback with our innovative idea, which brings to the patients a diagnostic tool which runs based on four different powerful machine learning algorithms and aggregates the optimal result. Our tool runs their specific numbers through these four algorithms, to produce the ideal result of disease name based on symptoms which is more optimal than other existing algorithms. Every disease is predicted using four algorithms: Naive Bayes, Decision Tree, Random Forest and KNearest Neighbors. 6.3.Tools used: > Tkinter library is used which is a standard GUI library of python which provides many fast and easy ways to create graphical user interface. > SQLite Database > Pandas library is used for data analysis. > Sklearn library is used for implementing machine learning and visualization algorithms. > Jupyter Notebook is used for creating and executing live code, equations, visualizations, statistical modeling, machine learning 6.4. Screenshots and Demo; 6.4.1. Training Dataset > Importing Libraries: In [ ]: #inporting Libraries from mpl_toolkits.mplot3d import Axes3D from sklearn.preprocessing import StandardScaler Anport matplotlib.pyplet as plt from tkinter import * inport numpy as np import pandas as pd import os These are the various imported libraries that are used in our project. ‘mpl_toolkits.mplot3d' library is used for generating 3D plots. ‘sklearn.preprocessing’ library is used for several common utility functions and learning algorithms benefit from standardization of the data set. ‘Tkinter’ library is used to build a Graphical User Interface in Python. ‘numpy’ library is used for complex mathematical operations and multi-dimensional array objects. ‘pandas’ library is used for data analysis. ‘OS’ module is used for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory. > Storing List of Symptoms in L1: LL1 is the list made for storing various Symptoms which are generally seen in most of the people for various Diseases. 16 > Storing List of diseases in list ‘disease’: disease=[*Fungal infection’, ‘Allergy’, ‘GERD’ "Drug Reaction’, ‘Peptic ulcer diseae’, ‘AIDS', ‘Diabetes * "Gastroenteritis', ‘Bronchial asthma’, ‘Hypertension ', ‘Mi ‘ervical spondylosis", ‘Paralysis (brain hemorrhage: ‘Nalaria’, "Chicken pox', ‘Dengue’, ‘Typhoid’, “Hepatitis 8", ‘Hepatitis c', ‘Hepatitis 0°, ' "alcoholic hepatitis", ‘Tuberculosis "Dimorphic henmorhoids(piles)', ‘Heat ‘chronic cholestasis’, > ‘Jaundice’, “hepatitis A’, itic &, ‘Conmon Cold", "Pneumonia, attack’, ‘Varicose veins’, ‘Hypothyroidism’, ‘Hyperthyroidism’, ‘Hypoglycemia’, ‘Osteoarthristis', ‘Arthritis’, "(vertigo) Paroymsal Positional Vertigo’, ‘Aer "Urinary tract infection’, ‘Psoriasis’, 'Inpetigo'] List named ‘disease’ is created for storing various types of diseases. > Creating a vacant list L2: ae for Lan cange(oyten(13): fzvepenaio)| print) i List named L2 is created. L2 is appended with zeros equivalent to number of diseases in list L4 > Reading the datasets (training .csv file): sacen(‘arening. cts eromosss") 57, Gastrounteritis’:8, Bronchiai Asthma’: Hy sion 218, torphie hemoreci (plas) "28, "here attack (ertige) Persea” 2 Dretlge" 8 ede siti : onary track nection, Datasets are stored in a CSV document named ‘training.csv’, which contains diseases and symptoms. ‘training.csv’ is utilized to prepare the model. ‘read_csv()' function is utilized to retrieve the information in the dataframe, named ‘df' from the training. csv file. m Values in the prognosis column in the imported file which contains disease names are replaced with numbers from 0 to n-1 using the inbuilt ‘replace()’ function in ‘pandas’, where n is the number of different diseases present in the .csv record. > Checking the created data frame (Output): ‘head()’ function is used for checking created dataframe “df” which prints first five rows of data frame in default eng sia nen srpont cones sneing stvong chil jot pin sama pin sciy dens on tongue —— pu ed ¢ This is the output produced by our code which contains the initial five rows of the dataframe ‘df’. > Distribution graphs (histogram/bar graph) of column data: wunique() . vee 2 edgecoter = 787) This is the code for plotting the distribution graph of the columns in ‘training.csv' file. For displaying purposes, we picked up the columns which contain 2 to 50 unique values. For finding the number of unique values in each column we used the ‘nunique()’ 18 function. We used shape function to find the dimensions of data frame ‘df1'.We used ‘subplot(m.n,p)' function for displaying graphs which displays graphs in a ‘m x n’ grid and creates axes in the position specified by ‘plot.bar()' function, which is used to print graphs of type bar graph > Output for the distribution graph of the columns of training.csv file: plotPerColumnDistribution(df, 10, 5) ‘for cab in Af dtfcot] nndqua() > 21) Soest) Calon = eine Sanea 20] fi {catemanee] Dotting scatter patrin(4¢2, elpno.75, fageizen(olotSize, plotSize], sgonale'hs acd, dD.emmeate core eoet 2 Bae pur.supiela( ‘seater ang Density Hot") 2») 2, HD (0.8, 8.2), sycoordostanes fraction, To keep only numerical columns in density plots we used ‘df1.select_dtypes(include=[np.number})’ and we have used ‘dropna()’ function to remove the coloums and rows which leads the data frame being singular. To keep columns where there are more than 1 unique values we used ‘nunique()' function. To 19 get list of column names in data frame we used ‘list(df)' function. If the number of elements in each column greater than 10, then we have reduced the number of columns for matrix inversion of kernel density plots > Output of scatter and density plots: plotScatterMatrix(df, 20, 10) > Storing symptoms in ‘X’ and creating a contiguous flattened array of diseases: x= df[11] y = df[["prognosis"]] np.ravel(y) print (XxX) Putting the Symptoms in X and prognosis/diseases in y for training the model. ‘Ravel()’ function is used to change a multi-dimensional array into a contiguous flattened array. 20 CHAPTER 7 ALGORITHMS 7.1.Decision Tree Function: root = Tk() predi=StringVar() ‘Root=Tk()’ is used to start the working of ‘tkinter’ library to build the GUI. ‘StringVar()’ is used to store the default value in pred1 which is an empty string ("") * Invi Name: def DecisionTree(): if len(NameEn.get()) predi.set(" ") comp=messagebox.askokcancel ("System if comp: root .mainloop( ) > If the user tries to run the system without entering the name, then System will prompt the following message: Kindly Fill the Name") # system x 8 Kindly Fill the Name Invalid No. of Symptoms: 21 Select Here”) or (Sympton2.get()=="Select Here")): ‘elif ((Symptont get predi.cet( symemessagebox. askokcancel ("Syster af sym root mainloop() "kindly Fil atleast first two Symptoms”) else: ‘from sklearn import tree > Alter filing the name, user have to select five symptoms and out of which first two are compulsory. If the user does'nt select atleast first two symptoms, then the System will prompt the following message: @ system eo) Kindly Fill atleast first two Symptoms > ‘DecisionTreeClassifier()’ in sklearn library is used to train the model and predict the disease on testing dataset according to the symptoms entered by the user. Final disease for decision tree is stored in a variable named “pred1”. Using ‘sklearn. metrics’, accuracy of predicting the disease is printed using ‘accuracy_score’ and, ‘confusion matrix’ is created using ‘confusion_matrix’ function. 22 else ~ from sklearn import tree cLf3 = tree.Oecisiontresclassitier() clf3 = clf3.f42(X,y) from sklearn.metrics import classification_report, confusion matrix,accuracy_score yopred=clf5.predict(X test) print("Decision Tree") print(“Accuracy") print(accuracy_score(y_test, y_pred)) print(accuracy_score(y_test, y_pred,normalize-False) ) print("Confusion matrix”) conf_natrix=confusion_matrix(y_test,y_pred) print conf_watrix) > Creating database using sqlite3: Creating a database named ‘database.db’, if it does not exist, using ‘sqlite” database. For storing results of decision tree algorithm, ‘Decision Tree’ table is created if not exists in database.db with columns as: Name of the user, Symptomt, Symptom2, Symptom3, Symptom4, SymptomS and disease, using ‘CREATE’ statement in sqlite. Values are inserted in ‘DecisionTree’ table using the ‘INSERT’ statement in sqlite. ngort sites ‘onnccursort| ‘sawncutaC“CEATE TABLE TF WOT SHEETS ecis serteqvart cStteute(smseet lito oecsiont 2, ‘oom come() *() {aymotont.get() Symptom. 4e2(),S/epton.g¥E(),Simptong(),S/aptons.Bet0) > Printing scatter plot of input symptoms: ‘Taking the symptoms as inputs using ‘scattering’ function and printing scatter plot of disease predicted vs its symptoms using ‘scatterplot’ function. Scattering (Symptomt. get(), Sympton2.get() ,Syeptons. get(), Symptons. get(), Symptons. get()) scatterpit(predi.get()) 23 7.2, Random Forest function: pred2=StringVar() “pred2’ is used to store the predicted disease using random forest algorithm. Invalid Name: if len(Nameén.get()) predi.set(" ") conp-messagebox.askokeancel ("System if comp: root .mainloop() indly Fill the Name") > If the user tries to run the system without entering the name, then the System will prompt the following message: @ system me @ rrr tie Cancel “ Invalid No. of Symptom elif ( (Symptomt .get( predi.set(" ") sym=messagebox.askokcancel("systen", "Kindly Fill atleast first two Symptons") Af sym: root .mainlooo() Select Here = elect Here")): > After filling the name, user have to select five symptoms and out of which first two are compulsory. If the user does'nt select atleast first two symptoms, then System the will prompt the following message: 24 @ system > Using ‘sklearn.ensemble’ library ‘RandomForestClassifier’ is imported. ‘N_estimators’ in ‘RandomForestClassifier’ is set to 100, which means that the algorithm should build 100 trees before taking the maximum voting. Using ‘sklearn.metrices’, the accuracy of predicting the disease is printed using ‘accuracy_score’ and confusion matrix is created using ‘confusion_matrix’ function, else: from sklearn.ensenble import RandonForestClassifier clf4 = RandonForestClassitien(n_estimatorsi@0) elfd = clfa.fit(X,np-ravel(y)) from sklearn.metrics smport classification_report, confusion matrix, accuracy_score y.predzelfa.predict(x test) print("Random Forest”) print ("Accuracy") print (accuracy score(y_test, y_pred)) print(accuracy_score(y_test, y_pred,normalize*False)) print("Confusion matrix") conf_natrix=confusion matrix(y testy pred) print conf_natréx) > 5 symptoms entered by the user are stored in ‘psymptoms' array. Each symptom is iterated over a ‘for loop’ in the range of 0 to length of L1 (List of symptoms). If the symptom is matched with L1 at index ‘k’, then L2[K] will be assigned as ‘1’, where L2[K] is the vacant list, whose length is equal to the length of L1 array. psymptons = [Symptont.get(),Sympton2.get(),Sympton3.get(), Symptom. get(),Sympton5.get()] for k in range(@,1en(11)) for z in psynptons: af(z==11{k]) a2[k]=1 25 7.3. Naive Bayes function: pred3=StringVar() ‘pred3'’ is used to store the predicted disease using ‘Naive Bayes’ algorithm. > ‘GaussianNB()’ in ‘sklearn.naive_bayes' library is used to train the model and predict the disease on testing dataset according to the symptoms entered by the user. Using ‘sklearn.metrics’, the accuracy of predicting the disease is printed using ‘accuracy_score’ and the confusion matrix is created using ‘confusion_matrix’ function. else: from sklearn.naive_bayes import GaussianNB gn = Gauseianne() gnbognb.Fit(X,np.ravel(y)) from sklearn.metrics import classification_report, confusion matrix,accuracy_score y_pred=gnb.predict(X test) print("Naive Bayes") print("Aecuracy") print(accuracy_score(y_tet, y_pred)) print(accuracy_score(y_test, y_pred, normalizesFalse)) print("confusion matrix") conf_matrixsconfusion_matrix(y_test,y_pred) print (conf_natrix) > 5 symptoms entered by the user are stored in symptoms. Each symptom is iterated cover a ‘for loop’ in the range of 0 to length of L1 (List of symptoms). If the symptom is matched with L1 at index ‘k’, then L2{K] will be assigned as ‘1’, where L2[K] is the vacant list, whose length is equal to the length of the L1 array. symptoms = [Symptont. get() ,Sympton2.get(),Synpton3.get(), Symptons. get(), Symptons. get()] for k An range(@, 1en(i1)) 26 > List L2 is passed to the ‘predict’ function, which enables us to predict the labels of the data values on the basis of the ‘Naive bayes' model. Output of the predict function is stored in ‘predicted’. This ‘predicted’ is iterated over a ‘for loop’ in the range of 0 to length of list ‘disease’ (which contains list of diseases) to check whether it is present in List ‘disease’ or not, If itis found in the list ‘disease’ at index ‘a’, then the value of parameter ‘pred3' is set to ‘disease[al’, else ‘pred3' is set to ‘Not Found’ inputtest = [12] predict = gnb.predict(inputtest) predicted=predict[e] h='no' for a in range(@,len(disease)): if(predicted == a): h="yes’ break if (h=="yes'): pred3.set(" ") pred3.set(disease[a]) else: pred3.set(" ") pred3.set("Not Found") > Creating database using sqlite3: Same database is used in database.db that is used in decision trees, and Random Forest algorithms. For storing results of Naive bayes algorithm "NaiveBayes" table is created if not exists in database.db with columns as: Name of the user, Symptomt, Symptom2, Symptom3, Symptam4, Symptoms and disease, using ‘CREATE’ statement in sqlite. Values are inserted in the 'NaiveBayes’ table using 'INSERT’ statement in sqlite. Angore seniees 27 7.4, k-Nearest Neighbour function: pred4=StringVar() ‘pred4’ is used to store the predicted disease using the ‘k-Nearest Neighbour’ algorithm. * Invalid Name: def KNN(): Af len(NameEn.get()) == @: predi.set(" ") comp=nessagebox.askokcancel("Systen", "Kindly Fill the Name") if comp: root .mainloop( ) Invalid No. of Symptoms: eLif( (Symptomt.get()=="Select Here") or (Symptom2.get() predi.set(" ") EymonessageDox.askokcancel(“systen”,"Kindly Fill atleast first tho Symptons") af ym rot mainloop() > ‘KNeighborsClassifier() in ‘sklearn.neighbors' library is used to train the model and predict the disease on testing dataset according to the symptoms entered by the user. Number of neighbors is set to 5, Power parameter for the ‘Minkowski’ metric is set to 2 which means ‘euciidean_distance’. Final disease for the decision tree is stored in a variable named ‘pred4’. Using sklearn.metries, the accuracy of predicting the disease is printed using ‘accuracy_score' and confusion matrix is created using ‘confusion_matrix’ array. else: ‘feom skiearn.neignbors snport Kile ghbersclassifier knnskneighborsClassifier(_neighbors=5,metric='ninkowsk! knnsknn.#it(X,np.ravel(y)) elect Here")): metrics import classification report, confusion matrix, accuracy_score print("accuracy") print(accuracy_score(y_test, y_pred)) print(accuracy_score(y test, y_pred,normalize-False)) print("\Confusion matrix") conf_matrix-confusion_matrix(y_test,y_pred) print (conf_satrix) 28 > 5 symptoms entered by the user are stored in ‘psymptoms’ array. Each symptom is. iterated over a ‘for loop’ in the range of 0 to length of L1 (List of symptoms). If the symptom is matched with L1 at index ‘k’, then L2[K] will be assigned as ‘1’, where L2{K] is the vacant list, whose length is equal to the length of Lt array. psymptons = [Symptom get(),Sympton2. get (),Sympton3. get (), Symptons. get(), Symptons. get()] for k an ran n(22)) for z in psynptons: af(ze=1i[k]): atk]= > Creating database using sqlite3: Same database is used i.e., database.db that is used in decision tree, Random Forest, and NaiveBayes algorithms. For storing results of K-Nearest neighbor algorithm "KNearestNeighbour" table is created if not exists in database.db with columns as: Name of the user, Symptom1, Symptom2, Symptom3, Symptom4, Symptoms and disease, using ‘CREATE’ statement in sqlite. Values are inserted in ‘KNearestNeighbour’ table using ‘INSERT’ statement in sqlite. > Printing scatter plot of input symptoms: Taking the symptoms as inputs using ‘scatterinp’ function and printing scatter plot of disease predicted vs its symptoms using ‘scatterplt’ function scatterplt(pred4.get()) 29 7.5.Final Prediction function: Protacstringver() ser se) Ps (ocean. 2())008 or Len(preda.get())=8 or Len(ress.get())-9 or enorestset0) Ents) the ¢-aigoriens”) > If user tries to find the final prediction without completing disease diagnosis with all the four algorithms, then the System will prompt the following message: @ system x eae > While clicking on this ‘Final Prediction’ button, the system may arise with 5 different cases, which are as follows: Eisner we eipeds 480) 1 the @algarsenne”) St eprarting act natntoee() ote enpsopreds get) Scnoes, empListappena( tena) enpList sppend(eena2) Semptist.appenc( tenes) eopList.appenct eros) for i in enpuist Lietz append ames) sf(qan(tanpLises)3 or nax(tenpLfst2}ont op (sin(tanoList2) lst and nox(tenpLiet2)==2)) 30 Case-1: All the 4 predictions (4 different algorithm results) may be the same (say ‘disease-a’). In this case the System will display the ‘disease-a’ as Final Prediction when the ‘Final Prediction’ button is clicked. Case-2: Any 3 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 4th prediction. In this case the System will display the ‘disease-a’ as Final prediction when the ‘Final Prediction’ button is clicked. Case-3: Any 2 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 2 predictions (remaining 2 predictions different with each other). In this case the System will display the ‘disease-a’ as Final prediction when the ‘Final Prediction’ button is clicked. Case-4: Any 2 of the 4 predictions may be same (say ‘disease-a’) and different with the remaining 2 predictions, which are the same indeed (say ‘disease-b’). In this case the System will Select one disease among disease-a and disease-b (System selects the disease which is not predicted by KNN algorithms since the accuracy of KNN is nearly 92% which is lesser than all the remaining algorithms [nearly 95%]) and display it as Final prediction when ‘Final Prediction’ button is clicked. Case-5: All the 4 predictions may be completely different from each other. In this case the System will display all the 4 diseases except the one that is predicted by KNN algorithm (since the accuracy of KNN is nearly 92% which is lesser than the all the remaining algorithms [nearly 95%]) as Final prediction when ‘Final Prediction’ button is clicked. 31 7.6.Building the Graphical User Interface: > Graphical User Interface is build using 'tkinter’ library in Python. ‘Root is used to start the GUI. Itis configured with the background that is set to ‘Ivory’ colour. GUI title is given as “Human Disease Prediction Using Machine Learning” using ‘tile()’ function in ‘tkinter’ library. ‘resizable()' function is used to fix the size GUI. root . configure(background=' Ivory") root.title(‘Human Disease Prediction Using Machine Learning’) root. resizable(@,0) > Here, 5 variables are defined: Symptom1, Symptom2, Symptom3, Symptom4, Symptoms of type ‘StringVar()’ and these variables are initialised to “Select Here” using ‘set()' function in ‘tkinter’ library. Symotont = stringvar() Symptond aet("Select Here") symotond = stringvar() Shmotond.set("Select Here") Symptons = Sywptond cet( "Select symotond = Steingvar() Symptom oet( "Select sere") symotons = Stringvar() Symotons.2et("Selece Here") Name = strdngvar() > This is how the above variables looks like in GUI: Select Here — Select Here — SelectHere — Select Here — Select Here — 32 > Function “Reset()" is defined to reset the GUI inputs, which are given by the user. It is called when the user click on the button “Reset Inputs’ in the GUI. prev_vinstone {bal prev_vin ‘sympton2 set ‘Symtons st ‘Symptond. set ("50 Namek celete( i reset rest set press. see(~) ty previa. destroy() brevinatone except terdbutetrrer: Reset Inputs “Reset Inputs” button in GUI > Function “Exit()" is defined, which is used to come out from the GUI. It is called when user click on the button “Exit System’ from the GUI. “Exit System” button in GUI 33

You might also like