You are on page 1of 4

Cancer cell classification using Scikit-learn

The Breast cancer wisconsin (diagnostic) dataset. The dataset includes several data
about the breast cancer tumors along with the classification’s labels, viz.,
malignant or benign.

pip install scikit-learn


#Importing the necessary module and dataset.
# importing the Python module
import sklearn
# importing the dataset
from sklearn.datasets import load_breast_cancer

#Loading the dataset to a variable


data = load_breast_cancer()

#Organizing the data and looking at it.


label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# looking at the data


print(label_names)
# each dataset of a tumor is labelled as either ‘malignant’ or ‘benign’.

print(labels)
#each label is linked to binary values of 0 and 1, where 0 represents malignant
tumors and 1 represents benign tumors.

print(feature_names)
# all the 30 features or attributes that each dataset of the tumor has. We will be
using the numerical values of these features in training our model and make the
correct prediction, whether or not a tumor is malignant or benign, based on this
features.

print(features)
# This is a huge dataset containing the numerical values of the 30 attributes of
all the 569 instances of tumor data.

#Organizing the data into Sets.


#Split our data into two sets, viz., training set and test set. We will be using the
training set to train and evaluate the model and then use the trained model to
make predictions on the unseen test set.

# importing the function


from sklearn.model_selection import train_test_split

# splitting the data


train, test, train_labels, test_labels = train_test_split(features, labels,
test_size = 0.33,
random_state = 42)

# The train_test_split() function randomly splits the data using the parameter
test_size. What we have done here is that, we have split 33% of the original data
into test data (test). The remaining data (train) is the training data. Also, we have
respective labels for both the train variables and test variables, i.e. train_labels and
test_labels.

#Building the Model.


For this model, using the Naive Bayes algorithm that usually performs well in
binary classification tasks. Firstly, import the GaussianNB module and initialize it
using the GaussianNB() function. Then train the model by fitting it to the data in
the dataset using the fit() method.

# importing the module of the machine learning model


from sklearn.naive_bayes import GaussianNB

# initializing the classifier


gnb = GaussianNB()

# training the classifier


model = gnb.fit(train, train_labels)

# making the predictions


predictions = gnb.predict(test)

# printing the predictions


print(predictions)

# the predict() function returned an array of 0s and 1s. These values represent the
predicted values of the test set for the tumor class (malignant or benign).
# importing the accuracy measuring function
from sklearn.metrics import accuracy_score

# evaluating the accuracy


print(accuracy_score(test_labels, predictions))

This machine learning classifier based on the Naive Bayes algorithm is 94.15%
accurate in predicting whether a tumor is malignant or benign.

References
https://www.geeksforgeeks.org/ml-cancer-cell-classification-using-scikit-learn/

You might also like