You are on page 1of 16

Artificial Intelligence

Group Project:
Breast Cancer Classification
Group4 – CityU7D
Tạ Thị Phương Anh
Nguyễn Đức Anh
Đoàn Lê Thiện Hảo
Hà Văn Nguyên
Nguyễn Việt Tùng
Table of content

1. Introduction section 3. Methodology & Algorithm


section

2. Dataset Description section 4. Evaluation section


1. Introduction section

We use the dataset to evaluate the goodness of the models, thereby selecting the best model.
This is a classification task. Because it classifies a diagnosis of breast cancer. The classes are:
benign and malignant.
2. Dataset Description section

Dataset has 569 rows and 33 columns. Attribute


number 2 is in categorical form, the rest are in
numerical form 1) ID number 2) Diagnosis (M =
malignant, B = benign) From 3 to 32.
Ten real-valued features are computed for e) smoothness (local variation in radius lengths)
each cell nucleus: f) compactness (perimeter^2 / area - 1.0)
a) radius (mean of distances from center g) concavity (severity of concave portions of the
to points on the perimeter) contour)
b) texture (standard deviation of gray- h) concave points (number of concave portions of
scale values) the contour)
c) perimeter i) symmetry
d) area j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these
features were computed for each image, resulting in 30 features. For instance, field 3 is Mean
Radius, field 13 is Radius SE, field 23 is Worst Radius.
We have the first attribute ID which is in
numeric form, so we remove them using
drop command. When the ID column is
lost, the dataset will still work normally
and will not be disturbed by the ID
number. For highly disparate data, we
separate out the disproportionate
columns of data to transform them using
the standard deviation method.

This dataset has a label for each data sample. This is a monitored issue. And there are no
missing values in this dataset. Just as there is no "noise" data in the dataset.
Compared to other countries in the world, the
United States is where breast cancer is the
second leading cause of death in women, after
lung cancer, but this rate is showing signs of
decreasing.
3. Methodology & Algorithm section
Logistic Regression, Decision Tree, Random Forest, Xgboost were applied

The main idea

We use it to get output that


can be transformed to return
a probability value.
Use decision tree algorithm to classify the
output of the dataset

The use of multiple decision tree algorithms at


random and then summing them
We choose them because these are supervised learning
algorithms with high accuracy, and there are some similarities
between them.
4. Evaluation section

Accuracy is used when the True Positives and


True negatives are more important
Based on the selected evaluation metrics, our
model received the highest score of 97.36%
Logistic Regression

There are 2 samples that are wrongly predicted: Fact is 1 (Malignant) ==> Prediction is 0 (Benign)
There is 1 sample that is wrongly predicted: Fact is 0 (Benign) ==> Prediction is 1 (Malignant)
Decision Tree

There are 4 samples that are wrongly predicted: Fact is 1 (Malignant) ==> Prediction is 0 (Benign)
There are 4 samples that is wrongly predicted: Fact is 0 (Benign) ==> Prediction is 1 (Malignant)
Random forest

There are 4 samples that are wrongly predicted: Fact is 1 (Malignant) ==> Prediction is 0
(Benign)
XGBOOST

There are 3 samples that are wrongly predicted: Fact is 1 (Malignant) ==> Prediction is 0 (Benign)
Thanks!
Any questions?

You might also like