Professional Documents
Culture Documents
Ahmed Ezzat
Supervisors:
Prof. Magda B. Fayek
Assoc Prof. Mona Farouk
Cairo University
Ahmed.e.mohamed@eng1.cu.edu.eg
GDD
To develop a machine learning algorithm to diagnose diseases by
examining the bio-medical features.
GDD
To develop a machine learning algorithm to diagnose diseases by
examining the bio-medical features.
Slide-Detect
To develop a Deep learning algorithm to diagnose lung infiltration by
examining the chest X-ray scans.
Motivation 1
Diabetes, cervical cancer and lung infiltration are leading cause of
deaths.
Motivation 1
Diabetes, cervical cancer and lung infiltration are leading cause of
deaths.
Motivation 2
Unlike physicians, Computer Aided Diagnoses (CAD) can process
large number of cases efficiently.
Motivation 1
Diabetes, cervical cancer and lung infiltration are leading cause of
deaths.
Motivation 2
Unlike physicians, Computer Aided Diagnoses (CAD) can process
large number of cases efficiently.
Motivation 3
The number of cases is increasing rapidly specially in the developing
countries.
Question 1
What are the most important features in diagnosing diabetes and
cervical cancer?
Question 1
What are the most important features in diagnosing diabetes and
cervical cancer?
Question 2
How are the most important features of diabetes and cervical cancer
distributed in the hyper-space?
Question 1
What are the most important features in diagnosing diabetes and
cervical cancer?
Question 2
How are the most important features of diabetes and cervical cancer
distributed in the hyper-space?
Question 3
How to increase the CAD diagnosing accuracy in diabetes, cervical
cancer and lung infiltration?
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Fields: 55 attributes
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Fields: 55 attributes
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Fields: 55 attributes
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Fields: 55 attributes
Diabetes Datasets
Name: Diabetes 130- US hospitals for years 1999-2008
Source: UCI machine learning repository
Records: 100,000
Fields: 55 attributes
Dataset
Name: ChestXray-NIHCC
Dataset
Name: ChestXray-NIHCC
Source: NIH clinical center
Dataset
Name: ChestXray-NIHCC
Source: NIH clinical center
Records: 112120
Dataset
Name: ChestXray-NIHCC
Source: NIH clinical center
Records: 112120
Fields: numerical, categorical, images attributes
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
Removing features which contain missing values above 50% (Diabetes
only)
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
Removing features which contain missing values above 50% (Diabetes
only)
Removing records with most of the fields missing (Diabetes only)
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
Removing features which contain missing values above 50% (Diabetes
only)
Removing records with most of the fields missing (Diabetes only)
Filling the missing values with the mean value (Diabetes only)
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
Removing features which contain missing values above 50% (Diabetes
only)
Removing records with most of the fields missing (Diabetes only)
Filling the missing values with the mean value (Diabetes only)
Removing features with low variance
GDD Pre-processing
Removing the obviously irrelevant features for the classification
process such as: Patient ID, Hospital name, Room number etc...
Converting categorial features into numeric features (Diabetes only)
Removing features which contain missing values above 50% (Diabetes
only)
Removing records with most of the fields missing (Diabetes only)
Filling the missing values with the mean value (Diabetes only)
Removing features with low variance
Normalizing attributes (features)
Training procedure
Separate the dataset into two subsets: one containing only the
positive class and the other the negative class
Training procedure
Separate the dataset into two subsets: one containing only the
positive class and the other the negative class
Split each subset into training and testing sets 70% and 30%
respectively
Training procedure
Separate the dataset into two subsets: one containing only the
positive class and the other the negative class
Split each subset into training and testing sets 70% and 30%
respectively
Cluster the training datasets using k-means clustering algorithm into
k clusters where k ∈ [1 : 20]
(where k can be different for positive and negative subsets)
Training procedure
Separate the dataset into two subsets: one containing only the
positive class and the other the negative class
Split each subset into training and testing sets 70% and 30%
respectively
Cluster the training datasets using k-means clustering algorithm into
k clusters where k ∈ [1 : 20]
(where k can be different for positive and negative subsets)
Save the obtained centers
Training procedure
Separate the dataset into two subsets: one containing only the
positive class and the other the negative class
Split each subset into training and testing sets 70% and 30%
respectively
Cluster the training datasets using k-means clustering algorithm into
k clusters where k ∈ [1 : 20]
(where k can be different for positive and negative subsets)
Save the obtained centers
Test the performance of the selected combination of centers using the
test set
Testing procedure
For every point in the test set, find the nearest centre to the test case
Testing procedure
For every point in the test set, find the nearest centre to the test case
Calculate the classification score as illustrated in the next frame
Slide-Detect Pre-processing
Separate the dataset classes into sample and control subsets
Slide-Detect Pre-processing
Separate the dataset classes into sample and control subsets
Normalize the subsets images
Slide-Detect Pre-processing
Separate the dataset classes into sample and control subsets
Normalize the subsets images
Perform a series of rotations, transitions, rescales, flips and zoom
operations to both the datasets
Slide-Detect Pre-processing
Separate the dataset classes into sample and control subsets
Normalize the subsets images
Perform a series of rotations, transitions, rescales, flips and zoom
operations to both the datasets
Save the resulting images to their corresponding subsets
Figure 9: Score Progress in cervical Cancer dataset with changing the number of
positive and negative classes clusters
Figure 11: Age distribution among the Infiltration patients in the Xchest dataset
Conclusion
This work proposes 2 algorithms for computer aided disease of
(diabetes, cervical cancer, lung infiltration)
Both algorithms outperformed the state of the art on the same
datasets achieving accuracies of .0999, 0.958 for GDD and 0.9333 for
slide-detect
Future Work
The GDD algorithm makes an exhaustive search for the optimum
number of clusters which is computationally expensive. A binary
search algorithm may achieve the same results in a log time
The slide detect algorithm can be extended to support 3D scans