You are on page 1of 21

Classification of tumor cells for the prediction of

breast cancer using machine learning

Guided by, Submitted by,


Mrs.M.Mahiba M.Tech,(AP/CSE) J.K.Benzersen,
R.Rangaswamy,
V.Sundaram.
Objectives

● Predict breast cancer with biopsy data

● To provide a extra layer verification for predicting breast cancer

● To create an API to predict breast cancer in real time

● To automatically train the models every time a new data is fed to the API
Abstract


Breast cancer is a cancer that forms in the cells of the breasts.


It is a very common disease nowadays,it can occur in both men and women.


The project deals with the prediction of the tumor cells by classifying data obtained
from Fine Needle Aspirate (FNA) to predict if a tumor is Malignant or Benign in
nature using logistic regresssion.


A Rest API is developed and deployed in the cloud which uses the preserved model to
predict breast cancer from the biopsy data sent from the client.
Existing system


supervised machine learning model such as Decision Trees, random forest, SVM or

regression algorithms used to create the moel to predict the tumor is malignant or

benign in nature.


Most studies use the SVM(Support vector machine) learning algorithm to create the

model due to its high accuracy rate in prediction.


Disadvantages of Existing Method


In production, the patient is required to undergo some form of physical activity to

generate the required input for the model.


The system heavily relies on the medical parameters of the patient which is not stable


Unstable inputs to the model can lead to false positives or false negatives.
Proposed System
● Using a machine learning classifier to predict breast cancer from the FNA
biopsy report

● Deploy a rest API on the cloud that will train the model in real time every time
a new data is provide

● Deploy a client facing web app to display the result of the classification of the
report.
Advantages of Proposed System
● Reports of false negatives and false positives can be prevented on a large
scale as the system can act as an external reference.

● Due to the systems self learning nature, the system will produce foolproof
result when it is provided with data over time.

● The system can also be used as a quick way to analyze the result of a FNA
report without much clinical knowledge.
Advantages of Proposed System
● Reports of false negatives and false positives can be prevented on a large
scale as the system can act as an external reference.

● Due to the systems self learning nature, the system will produce foolproof
result when it is provided with data over time.

● The system can also be used as a quick way to analyze the result of a FNA
report without much clinical knowledge.
Block Diagram

Training Learning Computing


Data Algorithm Model

New Data Model Prediction


Preprocessing

• Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format.

• Data preprocessing is the most important phase of a machine learning project,


especially in computational biology.If there is much irrelevant and redundant
information present or noisy and unreliable data,then knowledge discovery during the
training phase is more difficult
Feature Selection
The features that are considered for the training data are

a. radius (mean of distances from center to points on the perimeter)


b. texture (standard deviation of gray-scale values)
c. perimeter
d. area
e. smoothness (local variation in radius lengths)
f. compactness (perimeter^2 / area – 1.0)
g. concavity (severity of concave portions of the contour)
h. concave points (number of concave portions of the contour)
i. symmetry

The mean, standard error and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30
features.
Algorithm
● Since the data set contains more number of observation than the features, a
regression model is selected.

● The regression algorithm used for developing the model is Logistic regression

● This algorithm is selected due to its efficiency in drafting the regression line
between binary data.
Logistic Regression
● The logistic model is used to model the probability of a certain class or event
existing such as pass/fail, win/lose, alive/dead or healthy/sick.
● Since the result of an FNA report can either be Malignant or Benign
● , logistic regression is better suited
● The logistic regression is given by,
Logistic regression curve
Performance of the model

Test vs Training

The Normalization of the graph along 0 shows that the train and test data has been
overlapped. This is a good indication that the model has done a good job in classifying the
data.The spike in the center is the training data and the rectangular spike in the center below
the training data is the testing data.It isalso safe to say that the model has performed with an
accuracy of above 99 percent.
Results


A Classification report is used to measure the quality of predictions from

a classification algorithm.


The confusion matrix shows that the accuracy of the model is 100% but

however the model is not 100 % accurate and there were few miscalculations

during the prediction of real time data


Fig confusion matrix
CONCLUSION


The literature focuses on creating a better model for the classification of tumor cells
for predicting breast cancer.


The Model promises an accuracy of above 98%.


The reports of false negatives and false positives can be prevented on a large scale.
FUTURE ENHANCEMENT


The project is limited to its ability . So a more indepth steady has to be conducted in

order to source more data for detecting any patterns in biopsy data over time.


In future image processing approach will be taken to process the raw digital image of

the FNA to detect cancer without decoding the image


References

[1].L.A. Altonen, R. Saalovra., P. Kristo, F. Canzian, A.Hemminki, Peltomaki P, R. Chadwik, A. De La


Chapelle, "Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular
screening for the disease",N Engl J Med, Vol. 337, pp.1481–1487, 1998.

[2].S.Chakraborty, "Bayesian kernel probit model for Microarray Based Cancer


classification",Computational Statistics and Data Analysis, Vol.12, pp. 4198–4209, 2009.

[3].Siegel RL, Miller KD, Jemal A. Cancer Statistics , 2016. 2016;00(00):1-24. doi:10.3322/caac.21332.
Thank you

You might also like