You are on page 1of 53

Liver Disease Prediction

Using ML

A Project Report under the esteemed guidance of


Sri. Dinesh Kumar Hirawat
(Project Guide)

Submitted
By

Allangi Suresh
(21173-CM-004)

Submitted to

GOVERNMENT POLYTECHNIC REBAKA, ANAKAPALLE


(Affiliated to State Board Of Technical Education and accredited
by National Board of Accreditation) in the partial fulfilment of
the award in
Diploma in computer engineering
2021-2024
GOVERNMENT POLYTECHNIC REBAKA,
ANAKAPALLE

DIPLOMA IN COMPUTER ENGINEERING


2021–2024

BONAFIDE CERTIFICATE

This to Certify that this project work entitled “Liver Cancer Prediction
Using Machine Learning” is the bonafied record work done by Mr./Ms.
________________________________________ bearing the Pin No.
_______________ of the final year, along with batch mates submitted
on the partial fulfilment of the requirement for the award of Diploma in
Computer Engineering to the State Board of Technical Education and
Training. The results embodied in this project report have not been
submitted to any other board/ University or Institute for the award of
diploma.

PROJECT GUIDE HEAD OF THE DEPARTMENT

EXTERNAL EXAMINER
Table Of Content

1. Abstract…………………………..……………………………………………1
2. Acknowledgement………..………………………………………………2
3. Introduction………..…………………………………………………….3-7
3.1. Background and Motivation
3.2. Objectives
3.3. Overview of the Project
4. Literature Review……..……………………………………………....8-9
4.1. Introduction
4.2. Early Approaches
4.3. Machine Learning Techniques
4.4. Feature Selection and Importance
4.5. Data Imbalance and Bias
4.6. Performance Evaluation
4.7. Challenges and Future Directions
4.8. Conclusion
5. Data Description……………………………………………………10-11
5.1. Overview of the Project
5.2. Laboratory Test Result
5.3. Histopathological Data
5.4. Risk Factors
6. Methodology………….………………………………………………12-13
6.1. Data Collection and Acquisition
6.2. Data Preprocessing
6.3. Model Development
6.4. Model Evaluation
6.5. Model Validation
6.6. Clinical Translation
6.7. Continous Improvement and Research
7. Languages, Technologies, and Machine Learning Tools
Used…14-15
7.1. Python
7.2. Flask
7.3. Scikit-learn
7.4. NumPy and Pandas
7.5. Matplotlib and Seaborn
7.6. Pickle
7.7. Machine Learning Algorithms
8. Requirement Analysis………………………….………..………16-18
8.1. Objective
8.2. Functional Requirements
8.3. Non-Functional Requirements
8.4. Modules Used
8.5. Installation Instructions
9. Experimental Setup………………………….……………..…….19-20
9.1. Data Splitting
9.2. Model Training
9.3. Hyperparameter Tuning
10. Implementation..…………………………………………...21-30
10.1. Data Visulization
10.2. Data Pre-Processing
10.3. Algorithm
10.4. Model Traning
10.5. Model Selection
10.6. Model Testing
10.7. Model Evaluting
11. Project Code……………………….…………………….……31-47
11.1. Backend of the Project
11.2. Frontend of the Project
11.3. Connection to the frontend and backend
12. Refrences……………………………………………………………48
1. ABSTRACT

Liver cancer is a significant health concern worldwide and limited


treatment options, emphasizing the importance of early detection for
improved patient outcomes. In this study, we developed predictive
models using machine learning algorithms to identify individuals at risk
of liver cancer based on clinical and demographic features. The dataset
encompassed diverse attributes such as age, gender, liver function
tests, imaging findings, and medical history. We employed logistic
regression, support vector machine, decision tree, random forest, and
naive Bayes algorithms for model development and evaluation.
Among the algorithms tested, the Random Forest Classifier
demonstrated the highest accuracy of 76% in predicting liver cancer
risk. Random Forest excelled in capturing complex data relationships
and handling nonlinearity, contributing to its superior performance. The
model exhibited promising sensitivity and specificity, indicating its
potential as a valuable tool for early detection and risk assessment of
liver cancer.
Future steps include refining the predictive model by
incorporating additional data sources, such as genetic markers and
environmental factors, to enhance its accuracy and robustness.
Furthermore, efforts will focus on external validation of the model
using independent datasets and real-world clinical validation to assess
its clinical utility and feasibility for integration into healthcare practice.
In conclusion, the development of predictive models for liver
cancer using machine learning techniques holds promise for improving
early detection and risk assessment, thereby facilitating timely
interventions and improving patient outcomes. The Random Forest
Classifier emerged as a promising algorithm for liver cancer prediction,
with notable accuracy and potential for clinical translation. This
research contributes to the advancement of liver cancer diagnosis and
management, with implications for personalized medicine and public
health interventions.

Page | 1
2. ACKNOWLEDGMENTS

We would like to express our sincere gratitude to Mr. Girish Reddy, our
mentor and guide, whose invaluable support and expertise have been
instrumental in the successful completion of this project on liver cancer
prediction. Mr. Girish Reddy's guidance, encouragement, and insightful
feedback have inspired us throughout the project journey, helping us
navigate challenges and achieve our goals effectively.

We would also like to extend our appreciation to our colleagues and peers
for their collaboration, encouragement, and constructive discussions, which
have enriched our understanding and contributed to the project's progress.

Furthermore, we are grateful to the researchers, healthcare


professionals, and institutions whose work and contributions in the field of
liver cancer diagnosis and prediction have provided the foundation for our
project. Their dedication to advancing medical science and improving
patient outcomes serves as a constant source of inspiration.

Last but not least, we would like to thank our families and friends for their
unwavering support, patience, and encouragement throughout this
endeavour. Their understanding and encouragement have been invaluable
in sustaining our motivation and drive to pursue excellence.

Thank you to everyone who has played a part in this project. Your support
and collaboration have been indispensable, and we are deeply grateful for
the opportunity to work together towards a common goal.

Page | 2
3. INTRODUCTION

3.1. Background and Motivation:


Liver disease represents a significant public health challenge
worldwide, affecting millions of individuals and imposing a substantial
burden on healthcare systems. The liver plays a crucial role in various
physiological processes, including metabolism, detoxification, and nutrient
storage. Therefore, dysfunction or damage to the liver can have severe
consequences for overall health and well-being.

Several factors contribute to the prevalence and complexity of liver


disease, including:

1. Multiple Etiologies: Liver disease can arise from a diverse array of


causes, including viral infections (e.g., hepatitis B and C), excessive alcohol
consumption, non-alcoholic fatty liver disease (NAFLD), autoimmune
conditions, genetic disorders, and metabolic syndromes.

2. Silent Progression: In many cases, liver disease progresses silently,


with patients experiencing few or no symptoms until the condition reaches
advanced stages. As a result, diagnosis may occur late, leading to missed
opportunities for early intervention and treatment.

3. Limited Screening Tools: Traditional screening methods for liver


disease, such as liver function tests and imaging studies, may lack
sensitivity and specificity for early detection. Additionally, these tests may
be costly, invasive, or inaccessible to certain populations, further hindering
timely diagnosis.

4. Treatment Challenges: Treatment options for liver disease vary


depending on the underlying cause and severity of the condition. While
some liver diseases may respond well to lifestyle modifications,
medications, or surgical interventions, others may progress rapidly,
necessitating liver transplantation as the only viable treatment option.

5. Public Health Impact: Liver disease poses a significant economic and


social burden, resulting in increased healthcare expenditures, reduced
productivity, and diminished quality of life for affected individuals and
their families. Moreover, disparities in healthcare access and outcomes

Page | 3
exacerbate the impact of liver disease on vulnerable populations,
underscoring the need for equitable and effective diagnostic strategies.

Against this backdrop, the motivation for the liver disease prediction
project is clear:

- Early Detection: The project aims to develop accurate and reliable


predictive models that can identify individuals at risk of liver disease at an
early stage, allowing for timely intervention and treatment.

- Improved Patient Outcomes: By facilitating early diagnosis and


intervention, the project seeks to improve patient outcomes, reduce
disease progression, and minimize the risk of complications associated
with advanced liver disease.

- Resource Optimization: Early detection of liver disease can lead to


more efficient allocation of healthcare resources, including targeted
screening, preventive interventions, and personalized treatment plans,
thereby reducing healthcare costs and improving resource utilization.

- Advancing Medical Science: Through the integration of advanced


machine learning techniques, clinical expertise, and interdisciplinary
collaboration, the project contributes to the advancement of medical
science and technology in the field of liver disease diagnosis and
management.

- Addressing Health Disparities: By developing innovative


diagnostic tools that are accessible, affordable, and culturally sensitive, the
project aims to address health disparities and improve healthcare equity
for individuals affected by liver disease, particularly those from
underserved communities.

3.2. Objectives:
The objectives of the liver disease prediction project are multifaceted
and aim to address key challenges in the diagnosis and management of
liver disease. Here are the primary objectives

1. Develop Accurate Predictive Models: The foremost objective is to


develop robust machine learning models capable of accurately predicting
the risk of liver disease based on clinical and demographic data. These
models should effectively identify individuals at risk of developing liver
disease, enabling early intervention and treatment.

Page | 4
2. Incorporate Diverse Data Sources: The project aims to incorporate
diverse data sources, including patient medical history, laboratory test
results, imaging studies, lifestyle factors, and genetic information, to
enhance the predictive capabilities of the models. By leveraging multiple
data modalities, the models can capture the complex interplay of factors
contributing to liver disease risk.

3. Optimize Model Performance: Another objective is to optimize the


performance of the predictive models by employing state-of-the-art
machine learning techniques, feature engineering strategies, and model
tuning approaches. This involves iterative refinement of the models to
improve accuracy, sensitivity, specificity, and other performance metrics.

4. Address Data Imbalance and Bias: The project seeks to address


challenges related to data imbalance and bias in the training datasets,
particularly in the context of liver disease prediction. This involves
employing techniques such as oversampling, undersampling, and bias
correction to mitigate the impact of skewed class distributions and
demographic disparities on model performance.

5. Enhance Model Interpretability: In addition to predictive accuracy, the


project aims to enhance the interpretability of the predictive models,
allowing healthcare professionals to understand the factors driving the
predictions. This involves conducting feature importance analysis,
visualization techniques, and model explainability methods to elucidate the
underlying mechanisms of liver disease risk.

6. Validate Models on Independent Datasets: To ensure the


generalizability and robustness of the predictive models, the project aims
to validate them on independent datasets from diverse populations and
healthcare settings. This involves conducting cross-validation, external
validation, and prospective validation studies to assess model performance
in real-world clinical settings.

7. Translate Research Findings into Clinical Practice: Ultimately, the


project aims to translate research findings into actionable insights that can
be integrated into clinical practice. This involves collaborating with
healthcare providers, policymakers, and other stakeholders to implement
the predictive models as part of routine clinical care, thereby improving the
early detection and management of liver disease.

Page | 5
8. Contribute to Medical Research and Knowledge: The project aims to
contribute to the advancement of medical research and knowledge in the
field of liver disease diagnosis and management. This involves
disseminating research findings through publications, presentations, and
open-access repositories, as well as fostering collaboration with other
research groups and institutions.

3.3. Overview of the Project:


The liver disease prediction project is a comprehensive initiative
aimed at leveraging machine learning techniques to develop accurate and
reliable predictive models for early detection and risk assessment of liver
disease. The project encompasses several key components, including data
collection, preprocessing, model development, validation, and translation
into clinical practice. Here is an overview of the project:
1. Data Collection: The project begins with the acquisition of diverse
datasets containing clinical and demographic information from individuals
at risk of liver disease. These datasets may include patient electronic health
records, laboratory test results, medical imaging studies, lifestyle factors,
and genetic information. Data collection may involve collaboration with
healthcare institutions, research organizations, and public health agencies.
2. Data Preprocessing: Once the datasets are collected, they undergo
preprocessing to ensure data quality, consistency, and compatibility for
model training. This involves tasks such as data cleaning, missing value
imputation, feature encoding, and normalization to prepare the data for
analysis.
3. Feature Selection and Engineering: Next, feature selection and
engineering techniques are applied to identify the most relevant predictors
of liver disease and enhance the predictive power of the models. This may
involve statistical analysis, domain knowledge integration, and
dimensionality reduction methods to extract informative features from the
data.
4. Model Development: With the preprocessed data and engineered
features, various machine learning algorithms are employed to develop
predictive models for liver disease. These algorithms may include logistic
regression, decision trees, random forests, support vector machines, neural
networks, and ensemble methods. Multiple models are trained and

Page | 6
evaluated to identify the most effective approaches for predicting liver
disease risk.
5. Model Evaluation: The performance of the developed models is
evaluated using appropriate metrics such as accuracy, sensitivity,
specificity, precision, recall, and area under the receiver operating
characteristic curve (AUC-ROC). Model evaluation involves cross-
validation, external validation on independent datasets, and comparison
against baseline models to assess predictive performance and
generalizability.
6. Validation and Clinical Translation: Validated models are further
assessed in real-world clinical settings to evaluate their effectiveness in
identifying individuals at risk of liver disease. This involves collaboration
with healthcare providers and stakeholders to integrate the predictive
models into clinical practice, develop decision support tools, and evaluate
their impact on patient outcomes and healthcare delivery.
7. Continuous Improvement and Research: The liver disease prediction
project is an iterative process, with ongoing efforts to refine and improve
the predictive models based on feedback from clinical implementation and
new research findings. Continuous collaboration with medical
professionals, researchers, and data scientists ensures the project's
relevance, accuracy, and effectiveness in addressing the evolving challenges
of liver disease diagnosis and management.

In summary, the liver disease prediction project is a multidisciplinary


endeavor that combines expertise in medicine, data science, and healthcare
delivery to develop innovative solutions for early detection and risk
assessment of liver disease. By harnessing the power of machine learning
and predictive analytics, the project aims to improve patient outcomes,
reduce healthcare costs, and advance scientific understanding of liver
disease pathology and risk factors.

Page | 7
4. LITERATURE REVIEW

4.1. Introduction:
Liver disease is a global health burden affecting millions of individuals
worldwide. Early detection and accurate prediction of liver disease play a
crucial role in effective treatment and management. In recent years, machine
learning techniques have gained attention for their potential in predicting liver
disease based on clinical and demographic data. This literature survey aims to
review the state-of-the-art research in liver disease prediction using machine
learning methods.

4.2. Early Approaches:


Early studies focused on traditional statistical methods for liver disease
prediction, such as logistic regression and decision trees. These methods
demonstrated moderate success but lacked the ability to handle complex data
relationships and nonlinearities.
4.3. Machine Learning Techniques:
Recent research has explored the use of various machine learning
algorithms for liver disease prediction, including:
- Support Vector Machines (SVM): SVMs have been widely applied for
binary classification tasks in liver disease prediction. They are effective in
handling high-dimensional data and can capture nonlinear decision boundaries.
- Random Forests: Ensemble methods like random forests have shown
promise in liver disease prediction by combining multiple decision trees to
improve predictive accuracy and robustness.
- Neural Networks: Deep learning techniques, particularly neural networks,
have gained popularity for their ability to automatically extract features from
raw data and learn complex patterns. They have been applied to liver disease
prediction tasks with promising results.
- Gradient Boosting: Gradient boosting algorithms, such as XGBoost and
LightGBM, have been increasingly used for liver disease prediction due to their
efficiency and high predictive performance.

Page | 8
4.4. Feature Selection and Importance:
Feature selection techniques have been employed to identify the most
relevant predictors of liver disease. Methods such as recursive feature
elimination and feature importance analysis help prioritize informative features
and improve model interpretability.
4.5. Data Imbalance and Bias:
Addressing data imbalance and bias is crucial in liver disease prediction,
as datasets often exhibit skewed class distributions and demographic disparities.
Techniques such as oversampling, undersampling, and bias correction
algorithms help mitigate these challenges and improve model generalization.
4.6. Performance Evaluation:
Performance evaluation metrics such as accuracy, sensitivity, specificity,
precision, recall, and area under the ROC curve (AUC-ROC) are commonly
used to assess the predictive performance of liver disease models. Cross-
validation and external validation on independent datasets are essential for
validating model robustness and generalization.
4.7. Challenges and Future Directions:
Despite significant advancements, several challenges remain in liver
disease prediction, including data heterogeneity, model interpretability, and
clinical applicability. Future research directions may involve integrating
multimodal data sources, enhancing model explainability, and conducting real-
world validation studies to translate research findings into clinical practice.
4.8. Conclusion:
In conclusion, machine learning techniques offer promising avenues for
liver disease prediction, providing valuable insights for early diagnosis and
personalized healthcare interventions. Continued research efforts in this field
are essential to develop reliable and interpretable models for improving patient
outcomes and reducing the global burden of liver disease.

This literature survey provides a comprehensive overview of the current


landscape of liver disease prediction research using machine learning
techniques, highlighting key methodologies, challenges, and future directions.

Page | 9
5. Dataset Description

The liver disease prediction project relies on a comprehensive dataset


containing clinical and demographic information from individuals at risk of
liver disease. The dataset is crucial for training and evaluating predictive
models to accurately identify individuals who may be predisposed to liver
disease. Here is a description of the key attributes included in the dataset:

5.1. Overview of the Project:


- Age: The age of the patient at the time of data collection.
- Gender: The gender of the patient (male/female).
- BMI (Body Mass Index): A measure of body fat based on height and
weight.
5.2. Laboratory Test Results:
- Total Bilirubin: Serum level of total bilirubin, a marker of liver function.
- Direct Bilirubin: Serum level of direct bilirubin, a component of total
bilirubin indicative of liver dysfunction.
- Alkaline Phosphatase: Serum level of alkaline phosphatase, an enzyme
associated with liver and bone health.
- Alamine Aminotransferase (ALT): Serum level of ALT, an enzyme
found primarily in the liver that may indicate liver damage or
inflammation.
- Aspartate Aminotransferase (AST): Serum level of AST, an enzyme
found in various tissues, including the liver, heart, and muscles. Elevated
AST levels may indicate liver damage or other health conditions.
- Total Proteins: Serum level of total proteins, including albumin and
globulin, which are important for liver function and overall health.
5.3. Histopathological Data:
- Liver Biopsy Results: Histopathological findings from liver biopsy
samples, including inflammation, fibrosis, cirrhosis, and other pathological
features indicative of liver disease severity.

Page | 10
5.4. Risk Factors:
- Alcohol Consumption: Self-reported alcohol consumption habits,
including frequency and quantity of alcohol intake.
- Smoking Status: Self-reported smoking status (current smoker, former
smoker, non-smoker).

The dataset may also include additional variables and auxiliary


information relevant to liver disease prediction. Data preprocessing
techniques, such as handling missing values, outlier detection, and feature
scaling, are applied to prepare the dataset for model training and
evaluation. It is essential to ensure data privacy and confidentiality by
anonymizing or de-identifying sensitive patient information in compliance
with ethical and regulatory guidelines.

Page | 11
6. Methodology

The liver disease prediction project follows a systematic methodology that


encompasses data preprocessing, model development, evaluation, validation,
and clinical translation. Here is an overview of the key steps involved in the
methodology
6.1. Data Collection and Acquisition:
- Gather diverse datasets containing clinical and demographic information
from individuals at risk of liver disease. Collaborate with healthcare institutions,
research organizations, and public health agencies to obtain access to relevant
data sources.
- Ensure compliance with ethical and regulatory guidelines for data privacy,
confidentiality, and informed consent.
6.2. Data Preprocessing:
- Cleanse the raw data by addressing missing values, outliers, and
inconsistencies.
- Perform feature engineering to extract informative features, including
transformation, scaling, and encoding categorical variables.
- Split the dataset into training, validation, and test sets to facilitate model
development and evaluation.
6.3. Model Development:
- Select appropriate machine learning algorithms based on the nature of the
predictive task (e.g., binary classification).
- Train multiple models using the training dataset, employing techniques such
as logistic regression, decision trees, random forests, support vector machines,
neural networks, and ensemble methods.
- Tune hyperparameters to optimize model performance, using techniques
such as grid search, random search, or Bayesian optimization.
6.4. Model Evaluation:
- Assess the performance of the trained models using evaluation metrics such
as accuracy and area under the receiver operating characteristic curve.

Page | 12
- Conduct cross-validation to estimate the generalization performance of the
models and identify potential sources of overfitting or underfitting.
- Compare the performance of different models and select the best-performing
model(s) for further evaluation.
6.5. Model Validation:
- Validate the selected model(s) on independent datasets or through external
validation studies to assess their robustness and generalizability.
- Collaborate with healthcare providers and stakeholders to evaluate the
clinical utility and real-world effectiveness of the predictive models in
identifying individuals at risk of liver disease.
6.6. Clinical Translation:
- Integrate the validated predictive models into clinical practice by developing
decision support tools, electronic health record (EHR) systems, or mobile
applications.
- Provide training and education to healthcare professionals on the use of the
predictive models for early detection and risk assessment of liver disease.
- Monitor the impact of the predictive models on patient outcomes, healthcare
delivery, and resource utilization, and iterate on the models based on feedback
and performance metrics.
6.7. Continuous Improvement and Research:
- Continuously monitor and update the predictive models based on new data,
research findings, and emerging technologies.
- Foster collaboration with medical professionals, researchers, and data
scientists to advance scientific understanding of liver disease pathology, risk
factors, and predictive modeling techniques.
- Disseminate research findings through publications, presentations, and
knowledge sharing platforms to contribute to the broader scientific community
and promote further research in the field.

By following this methodology, the liver disease prediction project aims to


develop accurate, reliable, and clinically relevant predictive models for early
detection and risk assessment of liver disease, ultimately improving patient
outcomes and advancing medical science in the field.

Page | 13
7. Languages, Technologies, and Machine Learning
Tools Used
7.1. Python:
Python served as the foundational programming language for the
project, offering versatility, extensive libraries, and ease of use for various
tasks including data preprocessing, model training, and web development.

7.2. Flask:
Flask, a lightweight web framework for Python, was instrumental in
constructing the project's website. Flask facilitated URL routing, HTTP
request handling, and template rendering, enabling seamless integration of
machine learning functionalities into the web application.

7.3. Scikit-learn:
Scikit-learn, a leading machine learning library for Python, played a
pivotal role in training the predictive model for Liver cancer classification.
It provided a wide array of algorithms, including Support Vector Machines
(SVM), Logistic Regression (LR), Decision Trees (DT), and Random Forests
(RF), allowing for comprehensive exploration and selection of the most
suitable model for the task at hand.

7.4. NumPy and Pandas:


NumPy and Pandas were indispensable libraries utilized for data
manipulation and preprocessing. NumPy facilitated efficient numerical
operations and handling of multi-dimensional arrays, while Pandas
provided powerful data structures and functions for data analysis and
manipulation, ensuring the Liver cancer dataset was prepared
appropriately for model training.

7.5. Matplotlib and Seaborn:


Matplotlib and Seaborn, prominent Python libraries for data
visualization, enabled the creation of insightful visualizations to enhance
understanding of the Liver cancer dataset and model performance. These
libraries offered a diverse range of plotting functions and styles to
effectively represent data and results, aiding in interpretation and decision-
making.

Page | 14
7.6. Pickle:
Pickle, a Python module for object serialization, was utilized to save
the trained machine learning model to a file. This facilitated persistent
storage of the model, allowing for efficient loading within the web
application. Pickle ensured seamless integration of the trained model with
the Flask framework, enabling real-time predictions on user input.

7.7. Machine Learning Algorithms:


Various machine learning algorithms implemented through Scikit-
learn, such as SVM, LR, DT, RF, and NV were instrumental in building the
predictive model for Liver cancer classification. These algorithms
underwent rigorous training and evaluation processes, ensuring the
selection of an optimal model with high accuracy and robust performance.

By leveraging these languages, technologies, and machine learning


tools, the project successfully developed a user-friendly web platform for
Liver cancer prediction. This platform provides users with an intuitive
interface to input data, obtain accurate predictions, and gain valuable
insights into Liver cancer diagnosis, ultimately contributing to early
detection and improved patient outcomes.

Page | 15
8. Requirement Analysis

8.1. Objective:`

The primary objective of the Liver cancer prediction system is to


develop a robust machine learning model capable of accurately classifying
Liver masses as benign or malignant based on relevant features extracted
from fine needle aspirate (FNA) images. The system should provide a user-
friendly interface for inputting data, obtaining predictions, and visualizing
results, thereby aiding in early detection and clinical decision-making.

8.2. Functional Requirements:

1. Input Interface: The system should include a user-friendly interface for


users to input relevant data features, such as radius mean, texture mean,
perimeter mean, etc., extracted from FNA images.

2. Prediction Module: The system should incorporate a machine learning


module capable of accurately predicting the likelihood of Liver masses
being benign or malignant based on the input data features.

3. Visualization: The system should provide visualizations of the


prediction results, including but not limited to, classification results,
feature importance, and performance metrics.

4. Model Persistence: The system should allow for the trained machine
learning model to be persisted and loaded efficiently for real-time
predictions.

5. Integration with Web Framework: The system should be integrated


with a web framework (e.g., Flask) to deploy the predictive model as a web
application accessible via a web browser.

8.3. Non-Functional Requirements:

1. Accuracy: The predictive model should achieve high accuracy in


classifying Liver masses as benign or malignant, ensuring reliable
predictions.

2. Usability: The system should be intuitive and easy to use, catering to


Page | 16
users with varying levels of technical expertise.

3. Scalability: The system should be scalable to accommodate an


increasing volume of data and users, ensuring seamless performance.

4. Security: The system should adhere to security best practices to


safeguard sensitive patient data and ensure user privacy.

5. Compatibility: The system should be compatible with various operating


systems and web browsers, ensuring broad accessibility.

8.4. Modules Used:

1. Flask: For building the web application interface.


2. Scikit-learn: For implementing the machine learning model.
3. NumPy and Pandas: For data manipulation and preprocessing.
4. Matplotlib and Seaborn: For data visualization.
5. Pickle: For model persistence.

8.5. Installation Instructions:

To install Python 3.12.1:


1. Visit the official Python website (https://www.python.org/downloads/).
2. Download the Python 3.12.1 installer suitable for your operating system
(Windows, macOS, or Linux).
3. Run the installer and follow the on-screen instructions to complete the
installation process.
4. Optionally, ensure that the "Add Python 3.12.1 to PATH" option is
selected during installation to make Python accessible from the command
line.

To install pip for Windows 11:


1. Open a web browser and navigate to the official Python website
(https://www.python.org/downloads/).
2. Download the Python installer suitable for your Windows 11
architecture (32-bit or 64-bit).
3. Run the installer and ensure that the "Add Python to PATH" option is
selected during installation.
4. Open the Command Prompt by searching for "cmd" in the Windows
search bar.
5. Enter the following command to verify that Python and pip are installed

Page | 17
properly:
```
python --version
pip --version
```
6. If Python and pip are installed correctly, you can use pip to install
additional packages as needed for your project. For example:
```
pip install flask scikit-learn NumPy pandas matplotlib seaborn
```
This command installs the required modules for the Liver cancer
prediction project, including Flask for web development and Scikit-learn,
NumPy, Pandas, Matplotlib, and Seaborn for machine learning and data
analysis functionalities.

By following these installation instructions, you can set up the required


environment and modules for developing the Liver cancer prediction
system on Windows 11.

Page | 18
9. Experimental Setup

9.1. Data Splitting:

Data splitting is a critical step in machine learning model


development to ensure proper evaluation of model performance. In this
project, the dataset containing features extracted from Liver cancer images
will be divided into two subsets: a training set and a testing set.

1. Training Set: The majority of the dataset (e.g., 70% or 80%) will be
allocated to the training set. This portion of the data will be used to train
the machine learning models.

2. Testing Set: The remaining portion of the dataset (e.g., 30% or 20%)
will be reserved for the testing set. This independent subset of the data will
be used to evaluate the trained models' performance and assess their
generalization ability on unseen data.

9.2. Model Training:

After data splitting, the machine learning models will be trained


using the training set. The following steps will be performed during the
model training phase:

1. Initialization: Initialize the selected machine learning algorithms (e.g.,


Support Vector Machines, Logistic Regression, Decision Trees, etc.) with
default hyperparameters.

2. Training: Fit the initialized models to the training data, allowing them to
learn patterns and relationships between input features and target labels
(benign or malignant).

3. Evaluation: Assess the performance of the trained models on the


training set using appropriate evaluation metrics (e.g., accuracy, precision,
recall, F1-score, etc.) to gauge their initial performance.

9.3. Hyperparameter Tuning:

Hyperparameter tuning is essential for optimizing the performance


of machine learning models and fine-tuning their settings to achieve the

Page | 19
best results. In this project, hyperparameter tuning will be performed using
techniques such as grid search or random search to explore the
hyperparameter space and identify the optimal combination of
hyperparameters for each model.

1. Grid Search: Define a grid of hyperparameter values for each model,


specifying the range of values to explore. Perform an exhaustive search
over the grid, training and evaluating the model with each combination of
hyperparameters. Select the combination that yields the highest
performance on the validation set.

2. Random Search: Randomly sample hyperparameter values from


predefined distributions for each model. Train and evaluate the model with
randomly selected combinations of hyperparameters. Iterate this process
for a specified number of iterations or until convergence to identify the
best-performing combination.

By following these experimental setup steps, we can systematically


split the data, train the machine learning models, and optimize their
hyperparameters to develop robust predictive models for Liver cancer
classification. This approach ensures reliable model evaluation and
selection, ultimately contributing to the project's success in accurately
predicting Liver cancer instances.

Page | 20
10. IMPLEMENTATION

10.1. Data Visualization:

Data visualization plays a crucial role in understanding patterns,


trends, and relationships within the Liver cancer dataset, as well as in
evaluating the performance of machine learning models. Here are some key
visualizations that can be generated:

1. Histograms and Density Plots: Visualizing the distribution of


individual features (e.g., radius mean, texture mean, etc.) using histograms
or density plots can provide insights into their underlying characteristics
and help identify potential differences between benign and malignant
samples.

2. Scatter Plots: Pairwise scatter plots of different feature combinations


can reveal relationships and correlations between features. Scatter plots
colored by class labels (benign or malignant) can help visualize separability
between classes and identify informative feature combinations.

3. Box Plots: Box plots can be used to visualize the distribution of a feature
across different classes (benign vs. malignant). Box plots provide
information about the median, quartiles, and outliers, allowing for
comparisons between classes.

4. Correlation Heatmap: A correlation heatmap can illustrate the pairwise


correlations between features in the dataset. High correlations may
indicate redundant or collinear features, which can affect the performance
of machine learning models.

5. Confusion Matrix: A confusion matrix visualizes the performance of a


classification model by displaying the counts of true positives, false
positives, true negatives, and false negatives. It provides a detailed
breakdown of the model's predictions and can be used to calculate various
performance metrics such as accuracy, precision, recall, and F1-score.

6. Feature Importance Plot: Feature importance plots, such as bar plots


or violin plots, can visualize the importance of different features in
predicting Liver cancer outcomes. Feature importance scores can be
calculated using techniques like permutation importance or model-specific

Page | 21
methods (e.g., decision tree feature importance).

7. Model Comparison Plot: Visualizing the performance of multiple


machine learning models using bar plots or box plots can facilitate model
comparison. This allows researchers to assess the relative strengths and
weaknesses of different algorithms and select the most suitable model for
Liver cancer prediction.

By leveraging these data visualization techniques, researchers can


gain valuable insights into the Liver cancer dataset, evaluate the
performance of machine learning models, and communicate findings
effectively to stakeholders. Effective data visualization enhances
understanding, aids decision-making, and contributes to the success of
Liver cancer prediction efforts.

10.2. Data Pre–Processing:

Data preprocessing is a crucial step in machine learning model


development, particularly for tasks like Liver cancer prediction. It involves
cleaning, transforming, and preparing the raw dataset to make it suitable
for training machine learning models. Here are some key steps involved in
data preprocessing for Liver cancer prediction:

1. Data Cleaning:
- Handling Missing Values: Identify and handle missing values in the
dataset. Options include imputation (e.g., replacing missing values with the
mean, median, or mode), deletion of rows or columns with missing values,
or using algorithms that can handle missing values directly.
- Handling Outliers: Detect and address outliers in the dataset.
Outliers can skew statistical analyses and model predictions. Techniques
such as Z-score normalization or winsorization can be used to handle
outliers.

2. Data Transformation:
- Feature Scaling: Scale the features to a similar range to ensure that
no single feature dominates the others. Common techniques include Min-
Max scaling and Standardization (Z-score normalization).
- Encoding Categorical Variables: Convert categorical variables into
numerical representations that can be used by machine learning
algorithms. This can be done using techniques such as one-hot encoding or

Page | 22
label encoding.
- Feature Engineering: Create new features or transform existing
features to capture additional information that may improve model
performance. This could involve deriving new features from existing ones
or applying mathematical transformations.

3. Feature Selection:
- Select the most relevant features that are informative for predicting
Liver cancer outcomes. Feature selection techniques such as univariate
feature selection, recursive feature elimination, or feature importance from
tree-based models can be used to identify the most important features.
- Dimensionality Reduction: Reduce the dimensionality of the dataset
by removing irrelevant or redundant features. Techniques such as Principal
Component Analysis (PCA) or Singular Value Decomposition (SVD) can be
used for dimensionality reduction.

4. Data Splitting:
- Split the preprocessed dataset into training and testing sets. The training
set is used to train the machine learning model, while the testing set is used
to evaluate its performance. Typically, a random split (e.g., 80% training,
20% testing) is used, ensuring that the distribution of classes is preserved
in both sets.

5. Normalization:
- Normalize the data to ensure that all features have a similar scale. This is
particularly important for distance-based algorithms like K-Nearest
Neighbors (KNN) or Support Vector Machines (SVM).

6. Handling Imbalanced Data (Optional):


- If the dataset is imbalanced (i.e., one class is significantly more prevalent
than the other), techniques such as oversampling, undersampling, or
generating synthetic samples (e.g., using SMOTE) can be applied to balance
the dataset.

By performing these preprocessing steps, the raw dataset is


transformed into a clean, standardized, and informative format suitable for
training machine learning models for Liver cancer prediction. Effective data
preprocessing is essential for achieving accurate and reliable predictions,
ultimately contributing to improved healthcare outcomes

Page | 23
10.3. Algorithm:

1. Support Vector Machines (SVM): SVM is a powerful supervised


learning algorithm used for classification tasks. It works by finding the
optimal hyperplane that best separates the classes in the feature space.
SVM can handle both linear and non-linear classification problems and is
known for its effectiveness in high-dimensional spaces.

2. Logistic Regression (LR): Despite its name, logistic regression is a


linear model used for binary classification tasks. It estimates the
probability that a given sample belongs to a particular class based on its
input features. Logistic regression is simple, interpretable, and well-suited
for problems with linear decision boundaries.

3. Decision Trees (DT): Decision trees are non-linear models that


partition the feature space into regions based on the values of the input
features. They make decisions by recursively splitting the data into subsets,
with each split optimizing a chosen criterion (e.g., Gini impurity or
information gain). Decision trees are easy to interpret and can capture
complex relationships in the data.

4. Random Forests (RF): Random forests are an ensemble learning


technique that combines multiple decision trees to improve predictive
performance and reduce overfitting. Each tree in the forest is trained on a
random subset of the data and a random subset of the features. Random
forests are robust, scalable, and effective for handling high-dimensional
data.

5. Naive Bayes (NB): Naive Bayes is a probabilistic classifier based on


Bayes' theorem with the assumption of independence between features.
Despite its simple assumption, Naive Bayes often performs well in practice,
especially with high-dimensional data like that found in Liver cancer
datasets. It's computationally efficient and can handle large datasets with
ease. Naive Bayes classifiers are particularly useful when interpretability
and speed are important considerations.

6. K-Nearest Neighbors (KNN): K-Nearest Neighbors is a non-parametric


algorithm used for both classification and regression tasks. In classification,
KNN predicts the class of a data point by a majority vote of its k nearest
neighbors in the feature space. KNN is straightforward to understand and
implement, making it a popular choice for beginners and as a baseline
model. It's also a lazy learner, meaning it doesn't require a training phase,

Page | 24
which can be advantageous for certain applications.

These are just a few examples of machine learning algorithms that


can be used for Liver cancer prediction. The choice of algorithm depends
on factors such as the complexity of the problem, the nature of the data,
computational resources, and the desired interpretability of the model. In
practice, it is often beneficial to experiment with multiple algorithms and
compare their performance to determine the most suitable approach for a
given dataset.

10.4. Model training:


A training model is a dataset that is used to train an ML algorithm. It
consists of the sample output data and the corresponding sets of input data
that have an influence on the output. The training model is used to run the
input data through the algorithm to correlate the processed output against
the sample output. The result from this correlation is used to modify the
model.

This iterative process is called “model fitting”. The accuracy of the


training dataset or the validation dataset is critical for the precision of the
model.

Model training in machine language is the process of feeding an ML


algorithm with data to help identify and learn good values for all attributes
involved. There are several types of machine learning models, of which the
most common ones are supervised and unsupervised learning.

Supervised learning is possible when the training data contains both


the input and output values. Each set of data that has the inputs and the
expected output is called a supervisory signal. The training is done based
on the deviation of the processed result from the documented result when
the inputs are fed into the model.

Unsupervised learning involves determining patterns in the data.


Additional data is then used to fit patterns or clusters. This is also an
iterative process that improves the accuracy based on the correlation to the
expected patterns or clusters. There is no reference output dataset in this
method.

Page | 25
10.5. Model Selection:

Model selection is a critical step in machine learning where the best-


performing model is chosen from a set of candidate models based on their
performance on a validation dataset. In the context of Liver cancer
prediction, selecting the most suitable model ensures accurate and reliable
predictions, contributing to improved healthcare outcomes. Here's a
detailed overview of the model selection process:

1. Evaluation Metrics:
- Before selecting a model, it's essential to define evaluation metrics
that reflect the desired performance criteria. Common metrics for binary
classification tasks like Liver cancer prediction include accuracy, precision,
recall, F1-score, and area under the ROC curve (AUC-ROC). These metrics
provide insights into different aspects of the model's performance, such as
its ability to correctly classify benign and malignant instances, minimize
false positives, and balance precision and recall.

2. Cross-Validation (Optional):
- Cross-validation can be employed to assess the generalization
performance of different models and mitigate overfitting. In k-fold cross-
validation, the training dataset is divided into k subsets (folds), and the
model is trained and evaluated k times, each time using a different fold as
the validation set and the remaining folds as the training set. The average
performance across folds provides a more robust estimate of the model's
performance and helps identify models that generalize well to unseen data.

3. Model Performance Comparison:


- Train multiple candidate models using the same training dataset
and evaluate their performance on the validation dataset or through cross-
validation. Compare the models' performance metrics, considering both
overall performance and trade-offs between different metrics.
Visualizations such as ROC curves, precision-recall curves, and confusion
matrices can aid in comparing the models' performance visually and
identifying strengths and weaknesses.

4. Consideration of Model Complexity:


- Evaluate the complexity of the candidate models and consider
trade-offs between model complexity and performance. While complex
models may achieve high accuracy on the training dataset, they may suffer
from overfitting and perform poorly on unseen data. Simpler models, on

Page | 26
the other hand, may generalize better but may not capture complex
patterns in the data as effectively. Strike a balance between model
complexity and performance based on the specific requirements of the
Liver cancer prediction task.

5. Interpretability:
- Consider the interpretability of the models, especially in healthcare
settings where interpretability and transparency are crucial for gaining
trust from healthcare professionals and patients. Simple models like
logistic regression or decision trees are often more interpretable than
complex models like neural networks or ensemble methods. Choose a
model that provides a good balance between performance and
interpretability, depending on the stakeholders' needs.

6. Domain Knowledge and Expertise:


- Incorporate domain knowledge and expertise into the model
selection process, leveraging insights from medical professionals and
researchers familiar with Liver cancer diagnosis and treatment. Consider
factors such as the relevance of features, clinical interpretability of
predictions, and alignment with established medical guidelines when
selecting the final model.

7. Final Model Selection:


- Based on the evaluation results, select the best-performing model
that meets the predefined performance criteria, considering factors such as
accuracy, interpretability, generalization performance, and domain-specific
requirements. Document the rationale behind the model selection process,
including the evaluation metrics, cross-validation results, and
considerations of model complexity and interpretability.

By following a systematic approach to model selection, researchers


and practitioners can identify the most suitable model for Liver cancer
prediction, ensuring accurate and reliable predictions that can positively
impact patient outcomes and healthcare decision-making.

10.6. Model Testing:

Model testing is a crucial phase in machine learning where the


performance of a trained model is assessed on unseen data to evaluate its
effectiveness in making predictions. In the context of Liver cancer
prediction, model testing ensures that the developed predictive model can

Page | 27
generalize well to new instances and provide accurate classifications of
Liver masses as benign or malignant. Here's a detailed overview of the
model testing process:

• Prediction:

Use the loaded model to make predictions on the features of the


testing dataset. Input the feature values of each instance in the testing dataset
into the model and obtain the predicted class labels (benign or malignant) as
output. Ensure that the predictions are made in a consistent manner, adhering to
the model's prediction logic.

• Performance Evaluation:

Evaluate the performance of the model on the testing dataset using a


variety of evaluation metrics. Common metrics for binary classification tasks
like Liver cancer prediction include accuracy. Calculate these metrics using the
predicted labels and the ground truth labels from the testing dataset.

• Confusion Matrix Analysis:

Analyse the confusion matrix generated from the model predictions. A


confusion matrix provides a detailed breakdown of the model's predictions,
including true positives, true negatives, false positives, and false negatives. Use
the information from the confusion matrix to compute additional metrics such
as specificity, false positive rate, and false negative rate.

10.7. Model Evaluating:


Certainly! Evaluating a machine learning model for liver disease
prediction involves several steps to ensure its effectiveness and reliability.
Here's a structured approach you can follow:

1. Data Preprocessing:

- Data Cleaning: Handle missing values, outliers, and any inconsistencies in


the dataset.

- Feature Scaling: Scale numerical features if necessary to ensure all features


contribute equally to the model.

- Feature Encoding: Encode categorical variables into numerical format if


needed.

Page | 28
- Feature Selection: Identify and select relevant features that contribute most to
the prediction task.

2. Model Selection:

- Algorithms: Choose suitable algorithms for classification (e.g., Support


Vector Machine, Random Forest, Decision Tree, Logistic Regression, Naive
Bayes).

- Hyperparameter Tuning: Tune hyperparameters for each algorithm using


techniques like accuracy and search to optimize model performance.

3. Evaluation Metrics:

- Accuracy: Overall correctness of the model.

- Confusion Matrix: Visualize the performance of the model in terms of true


positives, true negatives, false positives, and false negatives.

4. Cross-Validation:

- Use techniques like k-fold cross-validation to assess model performance on


different subsets of the data and reduce overfitting.

5. Model Evaluation:

- Train-Test Split: Divide the dataset into training and testing sets to evaluate
the model's generalization ability.

- Validation Set: Optionally, use a separate validation set for hyperparameter


tuning.

- Model Performance: Evaluate each model using the chosen evaluation


metrics.

6. Interpretation:

- Understand the importance of features in the model's predictions.

- Analyze any patterns or insights gained from the model's behavior.

7. Model Deployment:

- Save the trained model to a file (e.g., using pickle or joblib) for future use.

Page | 29
- Integrate the model into a production environment for real-world predictions.

- Implement necessary data preprocessing steps in the deployment pipeline.

8. Monitoring and Maintenance:

- Monitor the model's performance over time in the production environment.

- Retrain the model periodically with updated data if necessary.

- Handle any concept drift or changes in data distribution that may affect model
performance.

By following these steps, you can effectively evaluate your liver disease
prediction model and ensure its accuracy and reliability for real-world
applications. If you need assistance with any specific aspect of model evaluation
or have further questions, feel free to ask!

Page | 30
11. Project Code

11.1. Backend of the project:


-Languages: Python

-Platform: Jupyter

11.1.1. Source Code:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df=pd.read_csv('liver.csv')
df.head()

df.shape

(583, 11)

df.columns
Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
'Albumin_and_Globulin_Ratio', 'Dataset'],
dtype='object')

df.describe()

Page | 31
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df.duplicated().sum()
df.isna().sum()
Age 0
Gender 0
Total_Bilirubin 0
Direct_Bilirubin 0
Alkaline_Phosphotase 0
Alamine_Aminotransferase 0
Aspartate_Aminotransferase 0
Total_Protiens 0
Albumin 0
Albumin_and_Globulin_Ratio 4
Dataset 0
dtype: int64

df[df['Albumin_and_Globulin_Ratio'].isna()]

df['Albumin_and_Globulin_Ratio'].fillna(df['Albumin_and_Gl
obulin_Ratio'].median(),inplace=True)
df.isna().sum()

Age 0
Gender 0
Total_Bilirubin 0
Direct_Bilirubin 0
Alkaline_Phosphotase 0
Alamine_Aminotransferase 0
Aspartate_Aminotransferase 0
Total_Protiens 0
Albumin 0

Page | 32
Albumin_and_Globulin_Ratio 0
Dataset 0
dtype: int64

df.head()

df['Gender'].value_counts()

Gender
Male 430
Female 140
Name: count, dtype: int64

df['Gender']=df['Gender'].map({'Female':0,'Male':1})
df.head()

sns.countplot(x='Gender', data = df, hue = 'Dataset')


plt.savefig("pie.png")

Page | 33
sns.countplot(x='Dataset', data=df)
plt.savefig("pie.png")

plt.figure(figsize = (20,20))
sns.heatmap(df.corr(), annot = True)
plt.savefig("pie.png")

Page | 34
df['Dataset'].value_counts()

Dataset
1 406
2 164
Name: count, dtype: int64

X=df.drop(['Dataset'],axis=1)
X.head()

y=df['Dataset']
y.sample(5)

14 1
347 1
468 1
295 1
243 1
Name: Dataset, dtype: int64

Data Preprocessing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)
## standardize the dataset
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)
X_train.shape

(427, 10)

Model Creation
Linear Regression
accuracy_lst=list()
from sklearn.linear_model import LinearRegression
Page | 35
model_ln =LinearRegression()
model_ln.fit(X_train,y_train)
pred_ln=model_ln.predict(X_test)
accuracy=model_ln.score(X_test,y_test)
accuracy

0.07033305757604935

Logistic Regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(random_state = 51, C=1,
penalty='l1', solver='liblinear')
model_lr.fit(X_train,y_train)
pred_lr=model_lr.predict(X_test)
model_lr.score(X_test,y_test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_test,pred_lr)
accuracy

0.6153846153846154

accuracy_lst.append(accuracy*100)

Decision Tree
from sklearn.tree import DecisionTreeClassifier
model_dtr = DecisionTreeClassifier()
model_dtr.fit(X_train,y_train)
pred_dtr=model_dtr.predict(X_test)
model_dtr.score(X_test,y_test)

0.6153846153846154

from sklearn.metrics import accuracy_score


accuracy=accuracy_score(y_test,pred_dtr)
accuracy

0.6153846153846154

accuracy_lst.append(accuracy*100)

SVM

Page | 36
from sklearn.svm import SVC
model_svc=SVC(kernel='rbf',random_state=0)
model_svc.fit(X_train,y_train)
pred_svc=model_svc.predict(X_test)
model_svc.score(X_test,y_test)

0.7272727272727273

from sklearn.metrics import accuracy_score


accuracy=accuracy_score(y_test, pred_svc)
accuracy

0.7272727272727273

accuracy_lst.append(accuracy*100)

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier


model_rfc=RandomForestClassifier()
model_rfc.fit(X_train,y_train)
pred_rfc=model_rfc.predict(X_test)
model_rfc.score(X_test,y_test)

0.7412587412587412

from sklearn.metrics import accuracy_score


accuracy=accuracy_score(y_test, pred_rfc)
accuracy

0.7412587412587412

accuracy_lst.append(accuracy*100)

KNeighbours Classifier

from sklearn.neighbors import KNeighborsClassifier


model_knn=KNeighborsClassifier(n_neighbors=21)
model_knn.fit(X_train,y_train)
pred_knn=model_knn.predict(X_test)
model_knn.score(X_test,y_test)

Page | 37
0.6713286713286714

from sklearn.metrics import accuracy_score


accuracy=accuracy_score(y_test, pred_knn)
accuracy
0.6713286713286714

accuracy_lst.append(accuracy*100)

Naive Bayes

from sklearn.naive_bayes import GaussianNB


model_nv=GaussianNB()
model_nv.fit(X_train,y_train)
pred_nv=model_nv.predict(X_test)
model_nv.score(X_test,y_test)

0.5454545454545454

from sklearn.metrics import accuracy_score


accuracy=accuracy_score(y_test, pred_nv)
accuracy

0.5454545454545454

accuracy_lst.append(accuracy*100)

Confusion Matrix

from sklearn.metrics import confusion_matrix


conf_mat=confusion_matrix(y_test,pred_rfc)
print(conf_mat)

[[92 11]
[26 14]]

Accuracy Graph

accuracy_lst

[74.82517482517483,

Page | 38
61.53846153846154,
72.72727272727273,
74.12587412587412,
67.13286713286713,
54.54545454545454]

algorithms=['LR', 'DT', 'SVM', 'RF','KNN', 'NB']


clor=['red', 'blue', 'purple', 'orange', 'brown', 'pink']
plt.bar(algorithms,accuracy_lst,color=clor)
plt.title('Performance Comparison')
plt.xlabel("Machine Learning Alogrithm")
plt.ylabel("Accuracy score")
plt.show()
plt.savefig("pie.png")

Pickling The Model File For Deployment

import pickle
#pickle.dump(model_rfc, open('model.pkl','wb'))
with open('model.pickle','wb') as f:
pickle.dump(model_rfc,f)
pred=model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_test,pred_rfc)
accuracy

0.7412587412587412

Page | 39
11.2. Frontend of the Project:
-Languages: HTML, CSS

-Platform: VS Code

11.2.1. Source Code:

Liver.html

<!DOCTYPE html>
<html>
<head>
<title>liver cancer</title>
<style>
@import
url('https://fonts.googleapis.com/css2?family=Poppins:wght
@200&family=Ubuntu:wght@300&display=swap');
*{
padding: 0;
margin: 0;
box-sizing: border-box;
font-family: 'Poppins',sans-serif;
outline: none;
user-select: none;
}
body{
padding: 0 50px;
background-color: rgb(0, 255, 170);
}
.header{
display: flex;
justify-content: center;
align-items: center;
margin: 0 auto;
padding: 40px 0;
}
.header h1{
font-family: 'Ubuntu', sans-serif;
letter-spacing: 4px;
font-size: 50px;
font-weight: 700;

Page | 40
}
.row{
display: flex;
align-items: center;
justify-content: space-between;
width: 100%;
padding: 10px 0;
margin-bottom: 20px;
}
input{
border: none;
background-color: white;
border: none;
color: #000;
width: 100%;
margin: 0 10px;
padding: 10px 10px;
font-size: 15px;
font-weight: 700;
box-shadow: -8px -8px 15px rgba(57, 56,
56, 0.236),5px 5px 15px rgba(17, 17, 17, 0.489);
border-radius: 6px;
outline: none;
}
input::placeholder{
color: #000;
}
.footer{
display: flex;
align-items: center;
justify-content: center;
margin: 0 auto;
}
.button{
font-size: 22px;
color: #fff;
background: #000;
width: 250px;
height: 60px;
cursor: pointer;
border-radius: 6px;

Page | 41
box-shadow: -8px -8px 15px rgba(57, 56,
56, 0.236),5px 5px 15px rgba(17, 17, 17, 0.489);
outline: none;
border: none;
display: grid;
place-content: center;
}
.loader{
pointer-events: none;
width: 30px;
height: 30px;
border-radius: 50%;
border: 3px solid transparent;
border-top-color: #fff;
animation: an1 1s ease infinite;
}
</style>
</head>
<body>
<div class="header">
<h1>Liver Cancer</h1>
</div>
<div class="body">
<form action="{{ url_for('predict') }}"
method="post">
<div class="row">
<input type="text" name="Age"
placeholder="Age" required="required">
<input type="text" name="Gender"
placeholder="Gender" required="required">
</div>
<div class="row">
<input type="text"
name="Total_Bilirubin" placeholder="Total Bilirubin"
required="required">
<input type="text"
name="Direct_Bilirubin" placeholder="Direct Bilirubin"
required="required">
<input type="text"
name="Alkaline_Phosphotase" placeholder="Alkaline
Phosphotase" required="required">
</div>

Page | 42
<div class="row">
<input type="text"
name="Alamine_Aminotransferase" placeholder="Alamine
Aminotransferase" required="required">
<input type="text"
name="Aspartate_Aminotransferase" placeholder="Aspartate
Aminotransferase" required="required">
<input type="text"
name="Total_Protiens" placeholder="Total Protiens"
required="required">
</div>
<div class="row">
<input type="text" name="Albumin"
placeholder="Albumin" required="required">
<input type="text"
name="Albumin_and_Globulin_Ratio" placeholder="Albumin and
Globulin Ratio" required="required">
</div>
<div class="footer">
<button class="button">Submit</button>
</div>
</form>
</div>
</body>
</html>

Page | 43
Predict.html

<!DOCTYPE html>
<html>
<head>
<title>results</title>
<style>
*{
padding: 0;
margin: 0;
box-sizing: border-box;
font-family: 'Poppins',sans-serif;
outline: none;
user-select: none;
}
body{
background-color: rgb(0, 255, 170);
height: 95vh;
display: grid;
place-content: center;
}
.predict{
display: flex;
align-items: center;
justify-content: center;
margin: 0 auto;
font-size: 30px;
}
</style>
</head>
<body>
<div class="predict">
{% if output == 1 %}
<p>Positive liver cancer</p>
{% else %}
<p>Negative liver cancer</p>
{% endif %}
</div>
</body>
</html>

Page | 44
11.3. Connection to the frontend and backend:

App.py(flask program for frontend and backend


connection)

from flask import Flask,render_template,request


import numpy as nu
import pandas as pa
import pickle

app = Flask(__name__)
model = pickle.load(open('liver.pkl','rb'))

@app.route("/")
def test():
return render_template("liver.html")

@app.route("/predict", methods=['POST','GET'])
def predict():
input_features = [float(x) for x in
request.form.values()]
features_values = [nu.array(input_features)]
print(input_features)

features_names =
['Age','Gender','Total_Bilirubin','Direct_Bilirubin','Alka
line_Phosphotase','Alamine_Aminotransferase',
'Aspartate_Aminotransferase','Total_
Protiens','Albumin','Albumin_and_Globulin_Ratio']

df = pa.DataFrame(features_values)
output = None
output = model.predict(df)
print(output)
return render_template("predict.html", output =
output)

if __name__=='__main__':
app.run(debug=True)

Page | 45
Output-1:

Page | 46
Output-2:

Page | 47
12. References
Certainly! Here are some references and resources that can help you with
your liver disease prediction project:

1. Datasets:
- The UCI Machine Learning Repository hosts several datasets related to liver
disease prediction, such as the "Liver Disorders Dataset" and the "Indian Liver
Patient Dataset".
- Kaggle is another platform where you can find datasets related to liver
disease prediction.
2. Research Papers:
- "Prediction of liver disease using machine learning algorithms" by Abdar,
M. et al. (2019).
- "Prediction of liver disease using ensemble classification" by Shenoy, P. et
al. (2018).
- "Predictive modeling for diagnosis of liver disorder using machine learning
techniques" by Srivastava, S. et al. (2019).
3. Books:
- "Introduction to Machine Learning with Python" by Andreas C. Müller and
Sarah Guido.
- "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow"
by Aurélien Géron.
4. Tutorials and Courses:
- Coursera and Udemy offer various courses on machine learning and data
science, which often include projects and case studies related to medical
prediction tasks.
- YouTube channels like sentdex and Data School provide tutorials on
implementing machine learning algorithms in Python for medical prediction
tasks.
5. GitHub Repositories:
- Search GitHub for repositories related to liver disease prediction or
healthcare analytics. You may find code implementations, datasets, and project
ideas.
6. Online Communities:
- Join communities such as Stack Overflow, Reddit (e.g., r/MachineLearning),
and LinkedIn groups related to data science and machine learning. You can ask
questions, share insights, and learn from others' e

Page | 48
Page | 1

You might also like