You are on page 1of 30

Using Machine Learning Techniques for Prediction of

Breast Cancer Subtypes


Machine Learning
• We learn from observations, examples, images,
signals, perception, data, experience,
• Computers can learn from these as well
• Machine learning: A field of artificial intelligence (AI)
that provides systems the ability to automatically
learn and improve from experience
Breast Cancer
• In the United States, it is estimated that about 1 in 8
women will develop breast cancer over the course of
their lifetime.
• In 2011, it is estimated that 508,000 women died due
to breast cancer.
• It is about 2,470 new cases of breast cancer are
expected to be diagnosed in men in 2017
Breast Cancer Genes
• Many genes could strongly be correlated to a
particular type of cancer
• Only a small subset of genes dominates the outcomes
• These subtypes are unique of each other, with each
having their own genetic signatures
Breast Cancer Subtypes

• Basal-like (Basal)
• Luminal A (LumA)
• Luminal B (LumB)
• HER2-enriched (HER2)
• Normal-like (Normal)
The Goals of this study
• To increase the overall accuracy of prediction (due to
many of the genes are irrelevant and redundant, using
all 13582 genes, accuracy = 77.84%)
• This work is expected to identify the subtypes of the
cancer of a new unknown instance
– Selecting the minimum subset of genes
– Yielding the highest classification accuracy
Dataset
• The dataset consists of 13,582 genes (features)
• The dataset contains 158 instances
• Labeled into five classes:
Basal, Her2, LumA, LumB, and Normal.
Machine Learning Tool Used
• Weka: The Waikato Environment for Knowledge
Analysis
• Java Implementations
• clustering, classification, regression, feature selection,
and visualization
The Proposed Model
• {D1, D2, D3, D4, D5} to {D1, D2}.
• {D1, D2, D3, D4, D5} to {D2 D3, D4, D5}
Feature Selection & Classification
The Tree-based Model
Features Plot
The Evaluation Results of the
First Node
(Basal Subtype)
The Evaluation Results of the
Second Node (Her2 Subtype)
The Evaluation Results of the
Third Node (Normal Subtype)
The Evaluation Results of the
Fourth Node (LumA Subtype)
Results:
• The model identified a total of 23 genes from 13,582
genes
• AGR2, TFF3
• This result could be beneficial in decreasing the time
and cost as only a few genes are needed to be
processed and analyzed.
The Comparison with the Study
by Rezaeian et al
Deep Neural Networks
• A deep neural network (DNN) is an artificial neural
network (ANN) with multiple layers between the input
and output layers
• Deep neural networks for classification

https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks
Deep Neural Networks
Machine Learning Steps Using
DNN
• Data collection
• Data cleaning – in Pandas
• Data preparation – in Numpy and the use Scikitlearn
• DNN extracts information from data – use Keras
• GPUs make the process faster
Classifiers Building
• The most used library to develop models in deep
learning is Tensorflow
• We used Keras which is a high level API built on
TensorFlow to implement our neural network
• Classifiers built based on Google Colab
DNN Binary Classification
Results:
Basal Her2
[[ 7, 1], [[ 8, 0],
Confusion Matrix Confusion Matrix
[ 0, 40]] [ 4, 36]]
Number of Instances 48
Number of Instances 48
Accuracy 97.91%
Accuracy 83.33%

Normal LumA
[[ 0, 0], [[ 17, 1],
Confusion Matrix [ 1, 47]] Confusion Matrix [ 8, 22]]
Number of Instances 48 Number of Instances 48
Accuracy 97.91% Accuracy 81.25%

Number of instance: 158


Split ratio: 0.3
Average accuracy: 85.41%.
DNN Multi-class Classification
Training vs Test Accuracy
Problems Resolved
• x_train, x_test, y_train, y_test =
train_test_split(X, Y, test_size = 0.1,
random_state = None)
• model.fit(x_train, y_train,
validation_data = (x_test, y_test),
batch_size = 10, epochs = 100,
verbose=1)
Conclusions
• Machine Learning can be used to determine the
smallest possible number of genes that make it
possible for specified treatment patient groups with
specific subtypes of cancer.
• SVM is more effective in classification for a dataset
with large features and small instances
• DNN classifiers: an accuracy at around 90% (99.25%,
77.84%)
Future Work
• As future work, we will validate these results to determine if the
predicted gene array can accurately denote each subtype gene.
• DNN
Thank You !

You might also like