You are on page 1of 21

Summer Internship

Project Report
on

Classification of Microcalcifications
Using CNN
Under the guidance of
Dr Satish Kumar Singh
Associate Professor, Department of Information Technology,
Indian Institute of Information Technology (IIIT-Allahabad)

Submitted by

D. Abhishek Reddy(16103109)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING,


INSTITUTE OF TECHNOLOGY,
GURU GHASIDAS VISHWAVIDYALAYA, BILASPUR,
CHHATTISGARH
INSTITUTE PROFILE

(A university established under Sec.3 of UGC Act, 1956 Vide Notification


No. F.9-4/99- U.3 dated 04.08.2000 of the Govt. of India)

Indian Institute of Information Technology, Allahabad (IIIT-Allahabad) is a


public university located in Allahabad (Jhalwa, Allahabad), in Uttar Pradesh
state in northern India. It is one of the nineteen Indian Institutes of
Information Technology listed by the Ministry of Human Resource
Development.

The institution was founded in 1999, and in the following year received
university status and the right to award its own degrees. In 2014 the IIIT Act
was passed, under which IIITA and four other Institutes of Information
Technology funded by the Ministry of Human Resource Development were
classed as Institutes of National Importance.

The first director of the institute was M.D. Tiwari, from 1999 to 2013. G.C.
Nandi served as director in-charge on the first four months of 2014, until
Somenath Biswas took over the directorship and served from May 2014 to July
2016. After another stint by Nandi as director that lasted until May 2017,P. Na-
gabhushan was appointed director.

IIIT-Allahabad was ranked 119 in BRICS nation by the QS World University


Rankings of 2019. Among engineering colleges in India, IIIT-Allahabad ranked
10th by India Today in 2019 and 12 by Outlook India in 2017. It was ranked 82
among engineering colleges by the National Institutional Ranking Framework
(NIRF) in 2019.
CERTIFICATE
ACKNOWLEDGEMENT

I have taken efforts in this project. However, it would not have been pos-
sible without the kind support and help of many individuals and organi-
zations. I would like to extend my sincere thanks to all of them.

I am highly indebted to Dr Satish Kumar Singh for his guidance and


constant supervision as well as for providing necessary information re-
garding the project & also for their support in completing the project.

I would like to express my special gratitude and thanks to Research


Scholar, Miss Suvidha Tripathi for giving me such her resourceful
time and without her help results and outcome of the work would not
have been achieved.

I would like to express my gratitude towards my parents & members of


(Computer Vision and Biometrics Lab , IIIT Allahabad ) for their
kind co-operation and encouragement which help me in completion of
this project.

I would also like to express my special gratitude and thanks to other re-
searcher scholars for giving me such attention and time.

My thanks and appreciations also go to my colleagues in developing the


project and people who have willingly helped me out with their abilities.
TABLE OF CONTENTS:

Abstract

Chapter 1. Introduction

Chapter 2. Technology and Architecture Used

Chapter 3. Design

Chapter 4. Results obtained

Chapter 5. Program description

Chapter 6. Program listing

Chapter 7. Conclusion

Chapter 8. Future Work

Bibliography
ABSTRACT

The project deals with the classification of Breast Micro calcifications given multi-view
mammograms into benign and malignant categories. It is important to detect them as many of
the women are becoming the victims of Breast cancer as it gave better results. So it is better
that before the actual disease occurs to a person we might try to detect it and take appropriate
precautions to avoid the person to become a victim of that disease. So here a pre trained mod-
el is taken and fine-tuned according to the training set. The dataset used here is Digital Data-
base for Screening Mammography (DDSM).The model is trained and testing is done. The
input is given to the model and it starts predicting the images as one of the benign or malig-
nant categories. We can make the labels noisy by randomly shuffling the true labels with
some probability and train the model to give better results which is not included in the work.
1.INTRODUCTION:

BREAST CALCIFICATIONS:
 Breast calcifications are small calcium deposits that develop in a wom-
an's breast tissue. They are very common and are usually benign (noncancerous).

 Basically there are two types of calcifications- Micro(cancerous) and Macro(non-


cancerous) Calcifications.

 Here we discuss mainly on Microcalcifications which are harmful.

BREAST MICROCALCIFICATIONS:

 Microcalcifications are small calcium deposits that look like white spots on
a mammogram.

 Microcalcifications are usually not a result of cancer. But if they appear in certain pat-
terns and are clustered together, they may be a sign of precancerous cells or ear-
ly breast cancer.

Fig: Micro calcifications in a mammogram

1
Why it is important to detect Microcalcifications?

 The appearance of microcalcifications is widely used in the detection of breast cancer


at an early stage and can lead to better outcomes.

 Nearly 50% of non-palpable cancers in the breast are detected only by the presence of
microcalcifications on a mammogram.

2.TECHNOLOGY USED:

 The driver program for microcalcifications detection is written in Python 3.6.7.


 The images used are from DDSM dataset which is available in kaggle.
 Keras is used as open source neural network library for deep learning functions.
 The training and testing the on data set has been done on Google Colaboratory.

2
ARCHITECTURE USED:

The model is modified according to our dataset and the last layers in the model are removed
and few more layers are added as it is fine tuned for making it to work on the images which
are preprocessed earlier.

The layers which are added are as follows:

x=mobile.get_layer('conv_pw_13_relu').output

x=Flatten()(x)

predictions=Dense(2, activation = 'softmax')(x)

The total parameters of the model are:

Total params: 3,394,754


Trainable params: 3,372,866
Non-trainable params: 21,888

3
3.DESIGN:

(a) Preprocessing Dataset for training the model:


 The dataset used here is Digital Database for Screening Mammography(DDSM)
 It is available on Kaggle in the form of TFRecord format.
 First we extracted images from TFRecord format of the dataset and distributed them
for training, validation and testing purposes.

Fig.Extracted images from the dataset

(b) Creating a CNN model for training:


We used Keras model MobileNet (which is already available) for image classification. But
MobileNet is used for classification of 1000 classes so we made a few modifications as re-
quired.

(c) Training the model on the preprocessed dataset:


Model was trained on 20000 images. Validation was done on 2000 and testing was done on
20000 images. Weights were saved in separate file to be used later for prediction of labels.

(d) Predicting the labels:


Another program is written for prediction. It loads the weights which were saved during the
training of model and uses it for prediction purpose.

4
4.RESULTS OBTAINED:

AFTER TRAINING ON DATA SET THE RESULTS OBTAINED ARE:

The overall accuracy of the model after finishing the final epoch is 96.8%

It took total 5943 seconds (1.65 hrs).

5
CONFUSION MATRIX AFTER PREDICTION:

Out of 20000 inputs, 19364 predictions are correct.

OTHER METRICS:

6
5. PROGRAM DESCRIPTION:
 Extract.py: This program is used to extract images from the Digital Database for
Screening Mammography dataset (DDSM) and are saved in the device storage.
 Training.py: This is used to perform training on the given dataset. This program was
ran on google colab. It uses Adam optimizer for training and also does the validation
on the set. Then it performs the testing on dataset and draws a confusion matrix and
other metrics.

6 PROGRAM LISTING:
a) Extract.py:
#Copied from kaggle website
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

def read_and_decode_single_example(filenames):
filename_queue = tf.train.string_input_producer(filenames, num_epochs=1)#hold file
names in FIFO queue

reader = tf.TFRecordReader()#defining a TFRecordReader

_, serialized_example = reader.read(filename_queue)#returns the next record


features = tf.parse_single_example(
serialized_example,
features={
'label_normal': tf.FixedLenFeature([], tf.int64),
'image': tf.FixedLenFeature([], tf.string)
})

# now return the converted data


label = features['label_normal']
image = tf.decode_raw(features['image'], tf.uint8)
image = tf.reshape(image, [299, 299, 1])

return label, image


7
label, image = read_and_decode_single_example(["C:/Users/Abhishek Red-
dy/Downloads/ddsm-
mammography/training10_2/training10_2.tfrecords","C:/Users/Abhishek Red-
dy/Downloads/ddsm-mammography/training10_4/training10_4.tfrecords"])
images_batch, labels_batch = tf.train.shuffle_batch([image, label], batch_size=16, capaci-
ty=2000,min_after_dequeue=10 )
global_step = tf.Variable(0, trainable=False)
#For saving the images in separate folders for training and validation
import cv2
import numpy
from io import StringIO
import numpy as np
from PIL import Image
import matplotlib
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
#path='F:/Extracted Images'
coord = tf.train.Coordinator()#used to manage the threads
threads = tf.train.start_queue_runners(coord=coord)#to fill the threads we need to call this
function
p=91000
label=[]
for j in range(2000):
la_b, im_b = sess.run([labels_batch, images_batch])

for i in range(10):
#plt.imshow(im_b[i].reshape([299,299]))
filename1="F:/Extracted Images/training set/0/%d.jpg" % (p+1)
filename2="F:/Extracted Images/training set/1/%d.jpg" % (p+1)
#cv2.imwrite(filename,im_b[i])
#plt.title("Label: " + str(la_b[i]))
if str(la_b[i])=='0':
#cv2.imwrite(filename1,im_b[i])
continue

8
else:
cv2.imwrite(filename2,im_b[i])
label.append(la_b[i])
#label.append(la_b[i])
#plt.show()
p=p+1
#for i in range(label):
#print(label)
coord.request_stop()#stop the threads

# Wait for threads to stop


coord.join(threads)
sess.close()

b) Training.py:
#For mounting the drive:
from google.colab import drive
drive.mount('/content/drive')
#Import libraries and Mobilenet weights
import numpy as np
import keras
from keras import backend as K
from keras.models import Sequential
from keras.models import Model
from keras.layers import Activation,Input,Dropout
from keras.layers.core import Dense, Flatten
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
from keras.preprocessing.image import ImageDataGenerator
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import *
from keras.models import load_model
from matplotlib import pyplot

9
from sklearn.metrics import precision_score, recall_score, accuracy_score,f1_score
from sklearn.metrics import confusion_matrix
import itertools

input1=Input(shape=(299,299,3))
mobile = keras.applications.mobilenet.MobileNet(weights='imagenet', include_top=False,
input_tensor=input1, input_shape=None, pooling=None, classes=2)

#Viewing the model Summary:


model.summary()

#Adding a few layers at the end of the model:


x=mobile.get_layer('conv_pw_13_relu').output
x=Flatten()(x)
predictions=Dense(2, activation = 'softmax')(x) #No of classes to be classified are 2
model = Model(input = input1, output =[predictions]) #Integrating the whole model
model.summary() #Viewing the model

#Loading training and validation batches:


train_path=r'/content/drive/My Drive/training set' #Contains 10000 images each of 0 and 1
classes so a total of 20000 images
valid_path=r'/content/drive/My Drive/train1' #Contains 4870 images of class 0 and 1030 of
class1
#Preprocessing the images for training (dividing the images into training and validation
batches
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size = (299,
299), classes=['0','1'], batch_size = 25)
valid_batches = ImageDataGenerator().flow_from_directory(valid_path, target_size = (299,
299), classes=['0','1'], batch_size =8)

#Compiling the model and adding parameters for training:


model.compile(Adam(lr=0.0001),loss='binary_crossentropy',metrics=['accuracy'])

10
#Training the model for 10 epochs
H=model.fit_generator(train_batches, steps_per_epoch = 1000, validation_data = val-
id_batches,validation_steps = 800, epochs = 10, verbose = 1)

#Saving the model


model.save(r'/content/drive/My Drive/mobilenet1.h5')

#Loading the saved model


loaded_model=load_model(r'/content/drive/My Drive/resnet1.h5')

#Plotting the results of the training


plot.subplot(211)
pyplot.title('Loss')
pyplot.plot(H.history['loss'], label='train')
pyplot.plot(H.history['val_loss'], label='validation')
pyplot.legend()
pyplot.subplot(212)
pyplot.title('Accuracy')
pyplot.plot(H.history['acc'], label='train')
pyplot.plot(H.history['val_acc'], label='validation')
pyplot.legend()
pyplot.show()

#Testing the model


test_path=r'/content/drive/My Drive/training set'
test_batches = ImageDataGenerator().flow_from_directory(test_path, target_size = (299,
299), classes=['0','1'], batch_size = 50,shuffle=False)
predictions = loaded_model.predict_generator(test_batches, steps = 400, verbose = 1)

#Plotting the confusion matrix(Copied from github)


def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):

11
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

print(cm)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# plotting the confusion matrix
test_labels = test_batches.classes
cm = confusion_matrix(test_labels, predictions.argmax(axis=1))
test_batches.class_indices
cm_plot_labels = ['0','1']
plot_confusion_matrix(cm, cm_plot_labels, title='confusion matrix')

12
#Showing the metrics

print('precision',precision_score(test_labels,predictions.argmax(axis=1), average='weighted'))
print('recall',recall_score(test_labels,predictions.argmax(axis=1), average='weighted'))
print('accuracy',accuracy_score(test_labels,predictions.argmax(axis=1)))
print('f1-score',f1_score(test_labels,predictions.argmax(axis=1),average='weighted'

13
FUTURE WORK:
 By Creating noisy labels:
Noisy labels are formed by randomly shuffling the original labels with some
probability. Adding noisy labels accuracy can be improved like it was shown in the
referenced paper.
 Size of data set can be increased to improve the accuracy.

 Number of epochs can be increased but see that the model does not overfit on the
training data.

 A good Graphical user interface can be made for testing so that user doesn’t interact
much with the code.

 The dataset can be tested on the model by adding or removing few layers accordingly
for better results.

 The dataset can be tested on various other pretrained models by fine-tuning them for
better accuracy.

14
BIBLIOGRAPHY:

1. CNN - https://en.wikipedia.org/wiki/Convolutional_neural_network
2. Open-CV - https://opencv.org
3. Python- https://www.python.org
4. TensorFlow - https://github.com/tensorflow
5. Keras - https://github.com/keras-team, Documentation -https://keras.io
6. CNN-cs231n.stanford.edu
7. CNN-https://youtu.be/vT1JzLH4G4 (Stanford University lectures)
8. Referenced paper: “TRAINING A NEURAL NETWORK BASED ON UNRELIABLE
HUMAN ANNOTATION OF MEDICAL IMAGES” by Yair Dgani, Hayit Greenspan, Ja-
cob Goldberger
9. Dataset: DDSM from www.kaggle.com

15

You might also like