Machine Learning Predicts Breast Cancer Subtypes

Using Machine Learning Techniques for Prediction of
Breast Cancer Subtypes

Machine Learning
• We learn from observations, examples, images,
signals, perception, data, experience,
• Computers can learn from these as well
• Machine learning: A field of artificial intelligence (AI)
that provides systems the ability to automatically
learn and improve from experience
Breast Cancer
• In the United States, it is estimated that about 1 in 8
women will develop breast cancer over the course of
their lifetime.
• In 2011, it is estimated that 508,000 women died due
to breast cancer.
• It is about 2,470 new cases of breast cancer are
expected to be diagnosed in men in 2017
Breast Cancer Genes
• Many genes could strongly be correlated to a
particular type of cancer
• Only a small subset of genes dominates the outcomes
• These subtypes are unique of each other, with each
having their own genetic signatures
Breast Cancer Subtypes
• Basal-like (Basal)
• Luminal A (LumA)
• Luminal B (LumB)
• HER2-enriched (HER2)
• Normal-like (Normal)
The Goals of this study
• To increase the overall accuracy of prediction (due to
many of the genes are irrelevant and redundant, using
all 13582 genes, accuracy = 77.84%)
• This work is expected to identify the subtypes of the
cancer of a new unknown instance
– Selecting the minimum subset of genes
– Yielding the highest classification accuracy
Dataset
• The dataset consists of 13,582 genes (features)
• The dataset contains 158 instances
• Labeled into five classes:
Basal, Her2, LumA, LumB, and Normal.
Machine Learning Tool Used
• Weka: The Waikato Environment for Knowledge
Analysis
• Java Implementations
• clustering, classification, regression, feature selection,
and visualization
The Proposed Model
• {D1, D2, D3, D4, D5} to {D1, D2}.
• {D1, D2, D3, D4, D5} to {D2 D3, D4, D5}
Feature Selection & Classification
The Tree-based Model
Features Plot
The Evaluation Results of the
First Node
(Basal Subtype)
Second Node (Her2 Subtype)
Third Node (Normal Subtype)
Fourth Node (LumA Subtype)
Results:
• The model identified a total of 23 genes from 13,582
genes
• AGR2, TFF3
• This result could be beneficial in decreasing the time
and cost as only a few genes are needed to be
processed and analyzed.
The Comparison with the Study
by Rezaeian et al
Deep Neural Networks
• A deep neural network (DNN) is an artificial neural
network (ANN) with multiple layers between the input
and output layers
• Deep neural networks for classification
https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks
Deep Neural Networks
Machine Learning Steps Using
DNN
• Data collection
• Data cleaning – in Pandas
• Data preparation – in Numpy and the use Scikitlearn
• DNN extracts information from data – use Keras
• GPUs make the process faster
Classifiers Building
• The most used library to develop models in deep
learning is Tensorflow
• We used Keras which is a high level API built on
TensorFlow to implement our neural network
• Classifiers built based on Google Colab
DNN Binary Classification
Results:
Basal Her2
[[ 7, 1], [[ 8, 0],
Confusion Matrix Confusion Matrix
[ 0, 40]] [ 4, 36]]
Number of Instances 48
Number of Instances 48
Accuracy 97.91%
Accuracy 83.33%
Normal LumA
[[ 0, 0], [[ 17, 1],
Confusion Matrix [ 1, 47]] Confusion Matrix [ 8, 22]]
Number of Instances 48 Number of Instances 48
Accuracy 97.91% Accuracy 81.25%
Number of instance: 158

Split ratio: 0.3
Average accuracy: 85.41%.
DNN Multi-class Classification
Training vs Test Accuracy
Problems Resolved
• x_train, x_test, y_train, y_test =
train_test_split(X, Y, test_size = 0.1,
random_state = None)
• model.fit(x_train, y_train,
validation_data = (x_test, y_test),
batch_size = 10, epochs = 100,
verbose=1)
Conclusions
• Machine Learning can be used to determine the
smallest possible number of genes that make it
possible for specified treatment patient groups with
specific subtypes of cancer.
• SVM is more effective in classification for a dataset
with large features and small instances
• DNN classifiers: an accuracy at around 90% (99.25%,
77.84%)
Future Work
• As future work, we will validate these results to determine if the
predicted gene array can accurately denote each subtype gene.
• DNN
Thank You !

Machine Learning Predicts Breast Cancer Subtypes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Predicts Breast Cancer Subtypes

Uploaded by

Copyright:

Available Formats

Using Machine Learning Techniques for Prediction of

Breast Cancer Subtypes

Number of instance: 158

You might also like