You are on page 1of 80

BREAST CANCER PREDICTION VIA MACHINE LEARNING

Abstract

Breast Cancer is leading cause of death among women’s. According to Cancer Report
Breast cancer is seems constantly increasing all over worldwide in past years and Its a Most
dreadful disease for women’s. Even medical field has enormous amount of data, certain tools
and techniques are needed to handle those data. Classification Techniques is one of main
techniques often used. This system Predict arising possibilities of Breast Cancer using
Classification Technique .This system provide the chances of occurring Breast cancer in
terms of percentage. The real time dataset is used in this system in order to obtain exact
prediction. The datasets are processed in Python Programming Language using three main
Machine Learning Algorithms namely Naïve Bayes Algorithm ,Decision Tree Algorithm and
Support Vector Machine (SVM)Algorithm, KNN, random forest classifier, logical regression.
The aim of the system to shows which algorithms are best to use in order perform prediction
tasks in medical Filed. Algorithm results are calculated in terms of accuracy rate and
efficiency and effectiveness of each algorithm

DEPT OF CSE 2022-2023 PAGE:1


BREAST CANCER PREDICTION VIA MACHINE LEARNING

Chapter 1
INTRODUCTION
Machine learning is the study of algorithms and statistical models that the computer systems
use to effectively perform a specific task without using any explicit instructions. Machine
learning is one of the small part of intelligence, and refers to a specific sub-part of Artificial
Intelligence is related to constructing algorithms that can make to accurate predictions about
future results. Machine learning algorithms build a mathematical model of some data, known
as “training set” in order to make predictions without being explicitly programmed to perform
the task. Classification rules are typically useful for medical problems that have been applied
mainly in the area of medical field.

Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would have
ever come across. As it is evident from the name, it gives the computer that which makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.

Data mining is extracting information and knowledge from huge amount of data. Data mining
is an essential step in discovering knowledge from databases. There are numbers of databases,
data marts, data warehouses all over the world. Data Mining is mainly used to extract the hidden
information from a large amount of database. Data mining is also called as Knowledge
Discovery Database (KDD).

The data mining has four main techniques namely Classification , Clustering, Regression, and
Association rule. Data mining techniques have the ability to rapidly mine vast amount of data.
Data mining is mainly needed in many fields to extract useful information from a large amount
of data. The fields like the medical field, business field, and educational field have a vast
amount of data, thus these fields data can be mined through those techniques more useful
information. Data mining techniques can be implemented through a machine learning
algorithm. Each technique can be extended using certain machine learning models.

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data pre-processing task.

DEPT OF CSE 2022-2023 PAGE:2


BREAST CANCER PREDICTION VIA MACHINE LEARNING

1.1 BREAST CANCER


Breast cancer is the disease in which cells in the breast grow out of control .there are
different kind of breast cancer. The kind of breast cancer depends on which cells in the
breast turns into Cancer. Breast cancer can begin with different parts of the breast. A
breast is made up of three main parts: lobules, Ducts, and Connective Tissue (which
consists of fibrous and fatty tissue) surrounds and holds everything together. Most
breast cancer begin in the ducts or lobules. Breast cancer can spread outside the breast
through blood vessels and lymph vessels .Breast Cancer could be an estragon related
cancer .Breast Cancer is the prime reason for demise of women. It is the second
dangerous cancer after lung cancer. In the year2019 according to the statistics provided
by World Cancer Research Fund it is estimated that over 2 million new cases were
recorded out of which 626,679 deaths were approximated. Of all the cancers, breast
cancer constitutes of11.6% in new cancer cases and come up with 24.2% of cancers
among women.

DEPT OF CSE 2022-2023 PAGE:3


BREAST CANCER PREDICTION VIA MACHINE LEARNING

1.2 OBJECTIVE
The proposed machine-learning approaches could predict breast cancer as the early
detection of this disease could help slow down the progress of the disease and reduce
the mortality rate through appropriate therapeutic interventions at the right time.
Applying different machine learning approaches, accessibility to bigger datasets from
different institutions (multi-centre study), and considering key features from a variety
of relevant data sources could improve the performance of modelling.

The primary concern of this research is to find the answer to queries relevant to the
classification of breast cancer through deep learning schemes using various multi-
imaging modalities. The following queries are considered while designing this
comprehensive study.

1. Types of imaging modalities recently used for breast cancer classification.

2. Types of the dataset (publicly and private) used to build deep learning
classification models.

3. Types of DL and ML classifiers were recently used for breast cancer


classification.

4. Challenges faced by the classifiers in accurately detecting masses.

5. Types of parameters used to evaluate breast cancer classifiers.

DEPT OF CSE 2022-2023 PAGE:4


BREAST CANCER PREDICTION VIA MACHINE LEARNING

CHAPTER 2
LITERATURE SURVEY
Cancer is the second death-causing disease that affects worldwide women. Cancer is a
disorder range of the lethal cell if left untreated leads to indolent lesions and mortality.
Abnormal cells are created as a result of a genetic mutation that grows out of control and
becomes cancerous due to the changes in its deoxyribonucleic acid. Benign (a noncancerous
tumour) does not invade neighbouring tissue while malignant (cancerous tumour) spread in
multiple body functions via the lymphatic system and elicits nutrients from the body tissues.
The most dominant cancer types are lymphoma, sarcoma, carcinoma, leukaemia, and
melanoma. Carcinomas is the most widely diagnosed form of cancers.

The breast tissues are comprised of various connective tissue, blood vessels, lymph nodes,
and lymph vessels. (Figure 1a) shows the anatomy of the female breast. It often establishes,
when the breast tissues grow abnormally and cell division is not controlled that results in the
formation of a tumour. The developed tumour can be invasive or non-invasive which usually
starts in milk ducts or the lobules. Invasive cancer may start in lymph nodes which spreads in
different organs using blood vessels but cancerous cells often remain separated from the
tumour. Moreover, breast cancer is classified into various subtypes based on their
morphology, shape, and structure

Early identification of breast cancer can assist in the prognosis process which can
successfully mitigate serious complications of the disease with higher recovery. Various
medical multi-imaging modalities such as digital mammography breast X-ray images
(DMG), Ultrasound sonograms (ULS), magnetic-resonance-imaging (MRI), Biopsy
(Histological images), and computerized thermography (CT) are exercised for breast cancer
screening and classification. The auto-detection of lesions, lesions volume and its contour in
mammography images is a prominent sign which is most significant in detecting the distorted
edge of the malignant and smooth edge of benign tumour. (Figure 1b) demonstrates the
benign and malignant masses in a digital mammogram. It truly helps radiologist’s in
investigating malignancy and quickly analysing the lesions to forbid avoidable biopsies.
Initially, the radiologists analyse the images manually and final decisions are suggested after
the mutual consensus of other experts. The availability of many radiologists at the same time

DEPT OF CSE 2022-2023 PAGE:5


BREAST CANCER PREDICTION VIA MACHINE LEARNING

in under-developed countries is a key issue. Moreover, the precise analysis of the multi-class
images depends upon the experiences and domain knowledge of the radiologist.

Furthermore, the initial identification of breast cancer needs comprehensive monitoring of


biochemical indicators and imaging modalities. CAD systems can serve as a second option to
resolve breast cancer multi-classification issues. It can serve as an inexpensive, voluntarily
accessible, speedy, and consistent source of early diagnosis of breast cancer. It can also assist
the radiologists in diagnosing breast cancer abnormalities which can significantly decrease
the mortality ratio from 30% to 70%.

Recently, various machine learning (ML), artificial intelligence (AI), and neural network
schemes are exercised for image processing. The key achievement of the CAD system is to
build an authentic and reliable system that can limit experimental oversights and can assist in
separating benign and malignant lesions with higher accuracy. These systems are used to
enhance image quality for human judgment and to automate the readability process of images
for better understanding and interpretation. Currently, various articles on breast cancer
detections, segmentation, and classification using ML and AI techniques have been
published. Most of the previous studies emphasized ML schemes using binary classification
for the detection of certain cancer like lung cancer, brain cancer, skin cancer, stomach cancer,
kidney cancer, and breast cancer.

Jaffar et a and Khan et al proposed a novel deep-learning-based model for breast cancer
screening and classification using mammographic images. Qiu et al proposed a technique
based on deep learning methods that classify the breast masses without lesions segmentation
and feature selection. Samala et al performed breast cancer binary classification by reducing
the computational complexities of all types of mammographic images. Nascimento et al.
extracted the morphological features from ULS images using binary classification. Youk et al
proposed a new ULS technique named as Elastography to differentiate the benign and
malignant lesions of breast cancer. The authors , developed deep-learning-based techniques
for suspicious ROI segmentation and classification using MRI modalities. Rasti et al.

developed a robust DL model for ROI segmentation and breast tumour classification using
segmented DCE-MRI images. De Nazar et al proposed a model by selecting the variable
value of the threshold for the segmentation of breast masses. Choi et al designed a CAD
model to extracts the ROI before the breast cancer classification. The ROI extraction is the
seclusion abnormal breast tissues from irrelevant regions that increase the accuracy and also
the big number of images needed for training and testing. Casti et al used QDA-LDA model
for auto-localization and classification of asymmetry ROI because it directly related to the
accuracy of doctor’s predicting and treatment Nahid et al. [33] proposed an approach that
extracts ROI patches from HP images for the classification of invasive and non-invasive
breast cancer by CNN. Bejnordi et al. and Feng et al performed a biopsy to classify the breast
WSIs into different categories through the deep-convolution neutral network and achieves the
highest accuracy in binary-classification of cancerous slides. Punitha et alused the
depigmentation technique to overcomes the merging of the neighbour region problems that
almost have similar properties. Strange et al focused on the classification and distribution of
microcalcification based on the topological model and morphological aspects.

DEPT OF CSE 2022-2023 PAGE:6


BREAST CANCER PREDICTION VIA MACHINE LEARNING

The key objective of this review to assists the researchers in developing a novel and robust
CAD tool which is computationally efficient and can help radiologist during the classification
of breast abnormalities. This comprehensive review has exploited key research directions
based on various multi-image modalities, image segmentation approaches, feature extraction
techniques, types of DL and ML algorithms, and performance parameters used to evaluate the
classification models. Statistical analysis of CAD systems considering different aspects is
also highlighted through graphical and tabular representations. Following are the key research
findings:

As per literature, it is observed that there are huge variations in shapes of breast (abnormal)
tissues, so the benchmarks can be taken off during the screening process. The micro-
calcification morphology is another significant factor for defining ROI, which is based on the
distance between each micro-calcification. A fixed-scale approach is based on the distance
between individual calcification used for defining the micro-calcification cluster while the
invariant-scale is a pixel-level novel approach that visualizes the various morphology aspects
(i.e., calcification cluster shape, size, density, and distribution) to the radiologist.
Furthermore, histogram-based methods and selection of optimal threshold is an efficient
approach for the segmentation and classification of masses and calcification. From literature,
it is also evident that none of a study has implemented this approach before. A novel CAD
system needs to be developed based on this approach to classify the calcification and masses.
A content-based image retrieval is a new approach based on mammogram indexing and ROI
patches classification. From literature, it is found that none of a study used indexing on ROI
patches to classify calcification and mass using a mammogram. However, indexing and ROI
classification-based CAD system needs to be developed with the help of expert radiologist to
get precise results. Furthermore, some challenges faced by DL algorithms for breast cancer
diagnostics are related to ultrasound images because of its low signal-to-noise ratio (SNR)
comparative to others. However, echogram is a new ULS imaging technology, which is much
cheaper for breast screening. So, the development of a new DL algorithm is a significant task
to break through the echogram image analysis. The CT or MRI image modalities are spatial
3D data which are very large in size and need higher computation resources. However, the
design of light models is an interesting research direction for training and inferencing.

DEPT OF CSE 2022-2023 PAGE:7


BREAST CANCER PREDICTION VIA MACHINE LEARNING

CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS

SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows

SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda

HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk

DEPT OF CSE 2022-2023 PAGE:8


BREAST CANCER PREDICTION VIA MACHINE LEARNING

CHAPTER 4
SYSTEM DEVELOPMENT PROCESS
4.1 MODEL USED

DEPT OF CSE 2022-2023 PAGE:9


BREAST CANCER PREDICTION VIA MACHINE LEARNING

4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.

4.1.2 SYSTEM DESIGN


The requirement specification of the previous phrase are studied here and the system design
is prepared. System design is used to specify the hardware and the system requirements of a
product. It also helps to define the overall architecture of a system. The software code that
has to be implemented in the next phrase is written here.

4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and
this type of testing is known as unit testing.

4.1.4 INTEGRATION AND TESTING


The units that are developed in the previous phrase are integrated into the system after
performing unit testing for each unit. The software designed has to be tested in order to find
out the error or the flaws in the software. Testing should be done before giving the software
to the client so that the client does not face any problems at the time of installation of the
software.

4.1.5 DEPLOYMENT OF THE SYSTEM


Once the testing is done and it is being found that no error or flaws with the product, the
product is released in the market.

4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase
modifications are made to the system in order to improve system performance.

DEPT OF CSE 2022-2023 PAGE:10


BREAST CANCER PREDICTION VIA MACHINE LEARNING

4.2 DATASET USED

4.3 Data Source


The dataset used here for predicting breast cancer is taken from Kaggle .Kaggle is
collection of database that are used for implementing machine learning algorithms. The
dataset used here is real dataset. The dataset consists of almost 600 instances of data with the
appropriate 9 clinical parameters. The clinical parameters of dataset is to test which are taken
to the breast cancer as like Age, Menopause, Tumour Size, Node-caps, Irradiation and etc.

4.4 ALGORITHM USED


This section describes about three algorithms used in this system namely Naïve Bayes
Classifier Algorithm ,Decision tree Classification Algorithm and Support Vector Machine
Algorithm (SVM) ,KNN, Random Forest Classifier, Logical Regression

4.4.1 Naïve Bayes Classifier Algorithm


Naïve Bayes classifier is a supervised algorithm which classifies the dataset on the basis of
Bayes theorem .The Bayes theorem is a rule or the mathematical concept that is used to get
the probability is called Bayes theorem. Bayes theorem requires some independent
assumption and it requires independent variables which is the fundamental assumption of
Bayes theorem. Naïve Bayes is a simple and powerful algorithm for predictive modelling .
This model is the most effective and efficient classification algorithm which can handle
massive, complicated, non-linear, dependent data. Naïve comprises two part namely naïve &
Bayes where naïve classifier assumes that the presence of the particular feature in a class is
unrelated to the presence of any other feature.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases

DEPT OF CSE 2022-2023 PAGE:11


BREAST CANCER PREDICTION VIA MACHINE LEARNING

of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:

4.4.2 Decision tree Classification Algorithm :


Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Root node - this main node, on basis of this node all other perform it function
• Interior node - the condition of dependent variables is handled by this node
• Leaf node - the final result is carried on a leaf node

DEPT OF CSE 2022-2023 PAGE:12


BREAST CANCER PREDICTION VIA MACHINE LEARNING

4.4.3 Support Vector Machine Algorithm :


Support Vector Machine is usually represented as SVM. It is an elegant and Powerful
Algorithm. The objective of the support vector machine algorithm is to find a hyper plane in
an N-dimensional space (N — the number of features) that distinctly classifies the data
points. To separate the two classes of data points, there are many possible hyper planes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e. the
maximum distance between data points of both classes. Maximizing the margin distance
provides some reinforcement so that future data points can be classified with more
confidence .Hyperplanes and Support Vector :Hyperplanes are decision boundaries that help
classify the data points. Data points falling on either side of the hyperplane can be attributed
to different classes. Also, the dimension of the hyperplane depends upon the number of
features. If the number of input features is 2, then the hyper plane is just a line. If the number
of input features is 3,then the hyperplane becomes a two-dimensional plane. It becomes
difficult to imagine when the number of features exceeds. In the SVM algorithm, we are
looking to maximize the margin between the data points and the hyperplane. The loss
function that helps maximize the margin is hinge loss.

4.4.4 K-Nearest Neighbour (KNN) Algorithm :


K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique. This algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to the
available categories. K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm. It can be used for Regression
as well as for Classification but mostly it is used for the Classification problems. K-NN is a
non-parametric algorithm, which means it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset. KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the new data.

DEPT OF CSE 2022-2023 PAGE:13


BREAST CANCER PREDICTION VIA MACHINE LEARNING

4.4.5 Random Forest Algorithm :


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model. As the
name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

4.4.6 Logistic Regression algorithm :


Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.

DEPT OF CSE 2022-2023 PAGE:14


BREAST CANCER PREDICTION VIA MACHINE LEARNING

CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries

• A Pandas Data Frame is a 2 dimensional data structure, like a 2 dimensional array, or


a table with rows and columns. Pandas is mainly used for data analysis and
associated manipulation of tabular data in Data Frames.
• NumPy can be used to perform a wide variety of mathematical operations on arrays.
It adds powerful data structures to Python that guarantee efficient calculations with
arrays and matrices and it supplies an enormous library of high-level mathematical
functions that operate on these arrays and matrices.
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots. Make interactive figures that can zoom, pan, update.
• Seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics
• Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
via a consistence interface in Python
Importing data set

Data pre-processing:-
Data pre-processing, a component of data preparation, describes any type of processing
performed on raw data to prepare it for another data processing procedure. It has traditionally
been an important preliminary step for the data mining process

1.Information elements collated on a number of individuals, typically used for the purposes
of making comparisons or identifying patterns

DEPT OF CSE 2022-2023 PAGE:15


BREAST CANCER PREDICTION VIA MACHINE LEARNING

2.We can use df.head To get first n rows

3.Generate descriptive statistics

DATA CLEANING AND TRANSFORMATION


1. The ISNULL() function returns a specified value if the expression is NULL.

DEPT OF CSE 2022-2023 PAGE:16


BREAST CANCER PREDICTION VIA MACHINE LEARNING

2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.

3. The dropna() method removes the rows that contains NULL values.

4. The fillna() method replaces the NULL values with a specified value

DEPT OF CSE 2022-2023 PAGE:17


BREAST CANCER PREDICTION VIA MACHINE LEARNING

5. Interpolate() function is basically used to fill NA values in the data frame or series.

6. Removes the rows or the columns that contains NULL values

7. describes the data structure provided by pandas

DEPT OF CSE 2022-2023 PAGE:18


BREAST CANCER PREDICTION VIA MACHINE LEARNING

8. Method returns count of the unique values in the data frame

DEPT OF CSE 2022-2023 PAGE:19


BREAST CANCER PREDICTION VIA MACHINE LEARNING

Visualizing the data


1.The process of finding trends and correlations in our data by representing it pictorially is
called Data Visualization. To perform data visualization in python, we can use various
python data visualization modules such as Matplotlib, Seaborn, Plot

2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False
Negative classification categories for binary classification.
3.Plotting the graph Using seaborn library and finding the correlation

DEPT OF CSE 2022-2023 PAGE:20


BREAST CANCER PREDICTION VIA MACHINE LEARNING

4. using confusion matrix and finding the predicted values and actual value

5. using seaborn library and finding the frequency distribution

6.using seaborn library and finding Correlation Between Features of Dataset

DEPT OF CSE 2022-2023 PAGE:21


BREAST CANCER PREDICTION VIA MACHINE LEARNING

DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe
and learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared
against the previous sets of data. The testing set acts as an evaluation of the final mode and
algorithm.

5.2 APPLYING ALGORITHMS:


5.2.1. LOGISTIC REGRESSION

DEPT OF CSE 2022-2023 PAGE:22


BREAST CANCER PREDICTION VIA MACHINE LEARNING

5.2.2. Support Vector Classifier

5.2.3. Decision Tree Classifier

5.2.4. KNN - K-Nearest Neighbour

DEPT OF CSE 2022-2023 PAGE:23


BREAST CANCER PREDICTION VIA MACHINE LEARNING

5.2.5. Naive Bayes

5.2.6. Random Forest Classifier

DEPT OF CSE 2022-2023 PAGE:24


BREAST CANCER PREDICTION VIA MACHINE LEARNING

RESULTS

5.3.1. logistic regression

5.3.2. Support Vector Classifier

5.3.3. Decision Tree Classifier

5.3.4. KNN - K-Nearest Neighbour

5.3.5. Naive Bayes

DEPT OF CSE 2022-2023 PAGE:25


BREAST CANCER PREDICTION VIA MACHINE LEARNING

5.3.6. Random Forest Classifier

RESULTS

➢ FINDING ALL MODELS SCORES AND ACCURACY

➢ Using Seaborn library and getting Classification Accuracy Comparison of


Models

DEPT OF CSE 2022-2023 PAGE:26


BREAST CANCER PREDICTION VIA MACHINE LEARNING

CONCLUSION :

Medical dataset can not only be classified with the previously mentioned algorithms from
machine learning, there are many algorithms and techniques which may perform better
than these. Production of accurate classifier which perform efficiently for medicinal
application is the main challenge we face in machine learning. Four main algorithms were
implemented in this System were Naïve Bayes Algorithm, Decision Tree Algorithm,
KNN, Random Forest Classifier, Logical Regression and SVM Algorithm. Our main aim
for the research is to discover the algorithm which performs faster, accurate and
efficiently. Logical Regression surpasses all the other algorithms with an accuracy of
85.5964 %.Thus I Conclude, this project by saying Logical Regression Classification
algorithm is best and better for handling medical data set. In the future, the designed
system with the used machine learning classification algorithm can be used to predict or
diagnose other diseases. The work can be extended or improved for the automation of
Breast cancer analysis including some other machine learning algorithms

DEPT OF CSE 2022-2023 PAGE:27


BREAST CANCER PREDICTION VIA MACHINE LEARNING

REFERENCES
[1] Jemal A, Murray T, Ward E, Samuels A, Tiwari RC, Ghafoor A, Feuer EJ, Thun MJ.
Cancer statistics, 2005. CA: a cancer journal for clinicians. 2005 Jan 1;55(1):10-30.
[2] Polat K, Güneş S. Breast cancer diagnosis using least square support vector machine.
Digital Signal Processing. 2007 Jul 1;17(4):694- 701.
[3] Akay MF. Support vector machines combined with feature selection for breast cancer
diagnosis. Expert systems with applications. 2009 Mar 1;36(2):3240-7.
[4] Yeh WC, Chang WW, Chung YY. A new hybrid approach for mining breast cancer
pattern using discrete particle swarm optimization and statistical method. Expert Systems
with Applications. 2009 May 1;36(4):8204-11.
[5] Marcano-Cedeño A, Quintanilla-Domínguez J, Andina D. WBCD breast cancer database
classification applying artificial metaplasticity neural network. Expert Systems with
Applications. 2011 Aug 1;38(8):9573-9.
[6] Kaya Y, Uyar M. A hybrid decision support system based on rough set and extreme
learning machine for diagnosis of hepatitis disease. Applied Soft Computing. 2013 Aug
1;13(8):3429-38.
[7] Nahato KB, Harichandran KN, Arputharaj K. Knowledge mining from clinical datasets
using rough sets and backpropagation neural network. Computational and mathematical
methods in medicine. 2015;2015.
[8] Liu L, Deng M. An evolutionary artificial neural network approach for breast cancer
diagnosis. In Knowledge Discovery and Data Mining, 2010. WKDD'10. Third International
Conference on 2010 Jan 9 (pp. 593-596). IEEE.
[9] Chen HL, Yang B, Liu J, Liu DY. A support vector machine classifier with rough set-
based feature selection for breast cancer diagnosis. Expert Systems with Applications. 2011
Jul 1;38(7):9014-22

DEPT OF CSE 2022-2023 PAGE:28


CREDIT CARD FRAUD DETECTION

Abstract

It is vital that credit card companies are able to identify fraudulent credit card
transactions so that customers are not charged for items that they did not purchase. Such
problems can be tackled with Data Science and its importance, along with Machine Learning,
cannot be overstated. This project intends to illustrate the modelling of a data set using
machine learning with Credit Card Fraud Detection. The Credit Card Fraud Detection Problem
includes modelling past credit card transactions with the data of the ones that turned out to be
fraud. This model is then used to recognize whether a new transaction is fraudulent or
not. Our objective here is to detect 100% of the fraudulent transactions while minimizing
the incorrect fraud classifications. Credit Card Fraud Detection is a typical sample of
classification. In this process, we have focused on analysing and pre-processing data sets
as well as the deployment of multiple anomaly detection algorithms such as Local Outlier
Factor and Isolation Forest algorithm on the PCA transformed Credit Card Transaction data.

DEPT OF CSE 2022-2023 PAGE NO:29


CREDIT CARD FRAUD DETECTION

Chapter 1

INTRODUCTION

'Fraud' in credit card transactions is unauthorized and unwanted usage of an account by


someone other than the owner of that account. Necessary prevention measures can be taken
to stop this abuse and the behaviour of such fraudulent practices can be studied to minimize
it and protect against similar occurrences in the future .In other words, Credit Card Fraud can
be defined as a case where a person uses someone else’s credit card for personal reasons while
the owner and the card issuing authorities are unaware of the fact that the card is
being used. Fraud detection involves monitoring the activities of populations of users in
order to estimate, perceive or avoid objectionable behaviour, which consist of fraud, intrusion,
and defaulting. This is a very relevant problem that demands the attention of communities such
as machine learning and data science where the solution to this problem can be automated.
This problem is particularly challenging from the perspective of learning, as it is characterized
by various factors such as class imbalance. The number of valid transactions far
outnumber fraudulent ones. Also, the transaction patterns often change their statistical
properties over the course of time. These are not the only challenges in the implementation
of a real-world fraud detection system, however. In real world examples, the massive
stream of payment requests is quickly scanned by automatic tools that determine which
transactions to authorize. Machine learning algorithms are employed to analyse all the
authorized transactions and report the suspicious ones. These reports are investigated by
professionals who contact the cardholders to confirm if the transaction was genuine or
fraudulent. The investigators provide a feedback to the automated system which is used to
train and update the algorithm to eventually improve the fraud-detection performance over
time.

DEPT OF CSE 2022-2023 PAGE NO:30


CREDIT CARD FRAUD DETECTION

OBJCTIVE

The key objective of any credit card fraud detection system is to identify suspicious events and
report them to an analyst while letting normal transactions be automatically processed.

For years, financial institutions have been entrusting this task to rule-based systems that employ
rule sets written by experts. But now they increasingly turn to a machine learning approach, as
it сan bring significant improvements to the process.

1. Higher accuracy of fraud detection. Compared to rule-based solutions, machine learning


tools have higher precision and return more relevant results as they consider multiple additional
factors. This is because ML technologies can consider many more data points, including the
tiniest details of behavior patterns associated with a particular account.

2. Less manual work needed for additional verification. Enhanced accuracy leads reduces the
burden on analysts. “People are unable to check all transactions manually, even if we are
talking about a small bank,” Alexander Konduforov, data science competence leader at
AltexSoft, explains. “ML-driven systems filter out, roughly speaking, 99.9 percent of normal
patterns leaving only 0.1 percent of events to be verified by experts.”

3. Fewer false declines. False declines or false positives happen when a system identifies a
legitimate transaction as suspicious and wrongly cancels it.

4. Ability to identify new patterns and adapt to changes. Unlike rule-based systems, ML
algorithms are aligned with a constantly changing environment and financial conditions. They
enable analysts to identify new suspicious patterns and create new rules to prevent new types
of scams.

DEPT OF CSE 2022-2023 PAGE NO:31


CREDIT CARD FRAUD DETECTION

CHAPTER 2

LITERATURE SURVEY

Prajwal Save et al. [18] have proposed a model based on a decision tree and a combination
of Luhn's and Hunt’s algorithms. Luhn's algorithm is used to determine whether an incoming
transaction is fraudulent or not. It validates credit card numbers via the input, which is the
credit card number. Address Mismatch and Degree of Out liernes are used to assess the
deviation of each incoming transaction from the cardholder’s normal profile. In the final
step, the general belief is strengthened or weakened using Bayes Theorem, followed by
recombination of the calculated probability with the initial belief of fraud using an
advanced combination heuristic. Vimala Devi. J et al. [19] To detect counterfeit transactions,
three machine-learning algorithms were presented and implemented. There are many
measures used to evaluate the performance of classifiers or predictors, such as the Vector
Machine, Random Forest, and Decision Tree. These metrics are either prevalence-dependent
or prevalence-independent. Furthermore, these techniques are used in credit card fraud
detection mechanisms, and the results of these algorithms have been compared. Popat and
Chaudhary [20] supervised algorithms were presented Deep learning, Logistic Regression,
Nave Bayesian, Support Vector Machine (SVM), Neural Network, Artificial Immune
System, K Nearest Neighbour, Data Mining, Decision Tree, Fuzzy logic based System,
and Genetic Algorithm are some of the techniques used. Credit card fraud detection
algorithms identify transactions that have a high probability of being fraudulent. We
compared machine-learning algorithms to prediction, clustering, and outlier detection. Shi
yang Xuan et al. [21] For training the behavioural characteristics of credit card transactions,
the Random Forest classifier was used. The following types are used to train the normal
and fraudulent behaviour features Random forest-based on random trees and random forest
based on CART. To assess the model's effectiveness, performance measures are
computed. Dornadula and Geetha S. [5] Using the Sliding-Window method, the
transactions were aggregated into respective groups, i. , some features from the window were
extracted to find cardholder's behavioural patterns. Features such as the maximum amount,
the minimum amount of a transaction, the average amount in the window, and even the time
elapsed are available. Sangeeta Mittal et al. [22] To evaluate the underlying problems,
some popular machine learning- algorithms in the supervised and unsupervised categories
were selected. A range of supervised learning algorithms, from classical to modern, have
been considered. These include tree-based algorithms, classical and deep neural networks,
hybrid algorithms and Bayesian approaches. The effectiveness of machine-learning
algorithms in detecting credit card fraud has been assessed. On various metrics, a number

DEPT OF CSE 2022-2023 PAGE NO:32


CREDIT CARD FRAUD DETECTION

of popular algorithms in the supervised, ensemble, and unsupervised categories were


evaluated. It is concluded that unsupervised algorithms handle dataset skewness better and thus
perform well across all metrics absolutely and in comparison to other techniques. Deepa
and Akila [17] For fraud detection, different algorithms like Anomaly Detection
Algorithm, K-Nearest Neighbour, Random Forest, K-Means and Decision Tree were used.
Based on a given scenario, presented several techniques and predicted the best algorithm to
detect deceitful transactions. To predict the fraud result, the system used various rules and
algorithms to generate the Fraud score for that certain transaction. Xia Ohan Yu et al. [23]
have proposed deep network algorithm for fraud detection A deep neural network algorithm
for detecting credit card fraud was described in the paper. It has described the neural
network algorithm approach as well as deep neural network applications. The pre-processing
methods and focal loss; for resolving data skew issues in the dataset. Siddhant. Bagga et al.
[24] presented several techniques for determining whether a transaction is real or fraudulent
Evaluated and compared the accomplishment of 9 techniques on data of credit card fraud,
including logistic regression, KNN, RF, quadrant discriminative analysis, naive Bayes,
multilayer perceptron, ada boost, ensemble learning, and pipelining, using different
parameters and metrics. ADASYN method is used to balance the dataset. Accuracy, recall, F1
score, Balanced Classification Rate are used to assess classifier performance and
Matthews‘s correlation coefficient. This is to determine which technique is the best to use to
solve the issue based on various metrics. Carrasco and Urban [25] Deep neural networks
have been used to test and measure their ability to detect false positives by processing alerts
generated by a fraud detection system. Ten neural network architectures classified a set
of alerts triggered by an FDS as either valid alerts, representing real fraud cases, or incorrect
alerts, representing false positives. When capturing 91.79 percent of fraud cases, optimal
configuration achieved an alert reduction rate of 35.16 percent, and a reduction rate of 41.47
percent when capturing 87.75 percent of fraud cases. Kibria and Sevkli [26] Using the grid
search technique, create a deep learning model. The built model's performance is compared
to the performance of two other traditional machine-learning algorithms: logistic regression
(LR) and support vector machine (SVM). The developed model is applied to the credit card
data set and the results are compared to logistic regression and support vector machine models.
Borse, Suhas and Dhotre. [27] Machine learning's Naive Bayes classification was used
to predict common or fraudulent transactions. The accuracy, recall, precision, F1 score,
and AUC score of the Naive Bayes classifier are all calculated. Asha R B et al. [14] have
proposed a deep learning-based method for detecting fraud in credit card transactions.
Using machine-learning algorithms such as support vector machine, k-nearest neighbour,
and artificial neural network to predict the occurrence of fraud. used.

DEPT OF CSE 2022-2023 PAGE NO:33


CREDIT CARD FRAUD DETECTION

CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS

SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows

SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda

HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk

DEPT OF CSE 2022-2023 PAGE NO:34


CREDIT CARD FRAUD DETECTION

CHAPTER 4

SYSTEM DEVELOPMENT PROCESS

• 4.1 MODEL USED

DEPT OF CSE 2022-2023 PAGE NO:35


CREDIT CARD FRAUD DETECTION

4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.

4.1.2 SYSTEM DESIGN


The requirement specification of the previous phrase are studied here and the system design is
prepared. System design is used to specify the hardware and the system requirements of a
product. It also helps to define the overall architecture of a system. The software code that has
to be implemented in the next phrase is written here.

4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and this
type of testing is known as unit testing.

4.1.4 INTEGRATION AND TESTING


The units that are developed in the previous phrase are integrated into the system after
performing unit testing for each unit. The software designed has to be tested in order to find
out the error or the flaws in the software. Testing should be done before giving the software to
the client so that the client does not face any problems at the time of installation of the software.

4.1.5 DEPLOYMENT OF THE SYSTEM


Once the testing is done and it is being found that no error or flaws with the product, the product
is released in the market.

4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase modifications
are made to the system in order to improve system performance.

DEPT OF CSE 2022-2023 PAGE NO:36


CREDIT CARD FRAUD DETECTION

4.1.7 DATASET USED


4.2 Data Source
The dataset contains transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions. It contains only numerical input variables which are the
result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide
the original features and more background information about the data. Features V1, V2, …
V28 are the principal components obtained with PCA, the only features which have not been
transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed
between each transaction and the first transaction in the dataset. The feature 'Amount' is the
transaction Amount, this feature can be used for example-dependant cost-sensitive learning.
Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under
the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for
unbalanced classification.

4.3 ALGORITHM USED


This section describes about three algorithms used in this system namely Naïve Bayes
Classifier Algorithm ,Decision tree Classification Algorithm and Support Vector Machine
Algorithm (SVM) ,KNN, Random Forest Classifier, Logical Regression

4.4.1 Naïve Bayes Classifier Algorithm


Naïve Bayes classifier is a supervised algorithm which classifies the dataset on the basis of
Bayes theorem .The Bayes theorem is a rule or the mathematical concept that is used to get the
probability is called Bayes theorem. Bayes theorem requires some independent assumption and
it requires independent variables which is the fundamental assumption of Bayes theorem. Naïve
Bayes is a simple and powerful algorithm for predictive modelling . This model is the most
effective and efficient classification algorithm which can handle massive, complicated, non-
linear, dependent data. Naïve comprises two part namely naïve & Bayes where naïve classifier
assumes that the presence of the particular feature in a class is unrelated to the presence of any
other feature.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence

DEPT OF CSE 2022-2023 PAGE NO:37


CREDIT CARD FRAUD DETECTION

each feature individually contributes to identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:

4.4.2 Decision tree Classification Algorithm :


Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions. It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Root node - this main node, on basis of this node all other perform it function
• Interior node - the condition of dependent variables is handled by this node
• Leaf node - the final result is carried on a leaf node

DEPT OF CSE 2022-2023 PAGE NO:38


CREDIT CARD FRAUD DETECTION

4.4.3 Support Vector Machine Algorithm :


Support Vector Machine is usually represented as SVM. It is an elegant and Powerful
Algorithm. The objective of the support vector machine algorithm is to find a hyper plane in
an N-dimensional space (N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyper planes that could be
chosen. Our objective is to find a plane that has the maximum margin, i.e. the maximum
distance between data points of both classes. Maximizing the margin distance provides some
reinforcement so that future data points can be classified with more confidence .Hyperplanes
and Support Vector :Hyperplanes are decision boundaries that help classify the data points.
Data points falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the number of input
features is 2, then the hyper plane is just a line. If the number of input features is 3,then the
hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number
of features exceeds. In the SVM algorithm, we are looking to maximize the margin between
the data points and the hyperplane. The loss function that helps maximize the margin is hinge
loss.

4.4.4 K-Nearest Neighbour (KNN) Algorithm :


K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. This algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories. K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm. It can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called
a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset. KNN
algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.

4.4.5 Random Forest Algorithm :


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name suggests,
"Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

DEPT OF CSE 2022-2023 PAGE NO:39


CREDIT CARD FRAUD DETECTION

4.4.6 Logistic Regression algorithm :


Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.

DEPT OF CSE 2022-2023 PAGE NO:40


CREDIT CARD FRAUD DETECTION

CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries

• A Pandas Data Frame is a 2 dimensional data structure, like a 2 dimensional array, or


a table with rows and columns. Pandas is mainly used for data analysis and associated
manipulation of tabular data in Data Frames.
• NumPy can be used to perform a wide variety of mathematical operations on arrays. It
adds powerful data structures to Python that guarantee efficient calculations with arrays
and matrices and it supplies an enormous library of high-level mathematical functions
that operate on these arrays and matrices.
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots. Make interactive figures that can zoom, pan, update.
• Seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics
• Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
via a consistence interface in Python

Importing data set

Data pre-processing:-

Data pre-processing, a component of data preparation, describes any type of processing


performed on raw data to prepare it for another data processing procedure. It has traditionally
been an important preliminary step for the data mining process

1.Information elements collated on a number of individuals, typically used for the purposes of
making comparisons or identifying patterns

DEPT OF CSE 2022-2023 PAGE NO:41


CREDIT CARD FRAUD DETECTION

DEPT OF CSE 2022-2023 PAGE NO:42


CREDIT CARD FRAUD DETECTION

2.We can use df.head To get first n rows

3.Generate descriptive statistics

DEPT OF CSE 2022-2023 PAGE NO:43


CREDIT CARD FRAUD DETECTION

DATA CLEANING AND TRANSFORMATION


1. The ISNULL() function returns a specified value if the expression is NULL.

2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.

DEPT OF CSE 2022-2023 PAGE NO:44


CREDIT CARD FRAUD DETECTION

3. The dropna() method removes the rows that contains NULL values.

4. The fillna() method replaces the NULL values with a specified value

DEPT OF CSE 2022-2023 PAGE NO:45


CREDIT CARD FRAUD DETECTION

5. Interpolate() function is basically used to fill NA values in the data frame or series.

6. Removes the rows or the columns that contains NULL values

DEPT OF CSE 2022-2023 PAGE NO:46


CREDIT CARD FRAUD DETECTION

7. describes the data structure provided by pandas

8. Method returns count of the unique values in the data frame

Visualizing the data


1.The process of finding trends and correlations in our data by representing it pictorially is
called Data Visualization. To perform data visualization in python, we can use various
python data visualization modules such as Matplotlib, Seaborn, Plot

DEPT OF CSE 2022-2023 PAGE NO:47


CREDIT CARD FRAUD DETECTION

2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False Negative
classification categories for binary classification.
3.Plotting the graph Using seaborn library and finding the correlation

4. using confusion matrix and finding the predicted values and actual value

DEPT OF CSE 2022-2023 PAGE NO:48


CREDIT CARD FRAUD DETECTION

DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split,
one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe and
learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared against
the previous sets of data. The testing set acts as an evaluation of the final mode and algorithm.

5.2 APPLYING ALGORITHMS:


5.2.1. LOGISTIC REGRESSION

DEPT OF CSE 2022-2023 PAGE NO:49


CREDIT CARD FRAUD DETECTION

5.2.2. Support Vector Classifier

5.2.3. Decision Tree Classifier

DEPT OF CSE 2022-2023 PAGE NO:50


CREDIT CARD FRAUD DETECTION

5.2.4. KNN - K-Nearest Neighbour

5.2.5. Naive Bayes

5.2.6. Random Forest Classifier

DEPT OF CSE 2022-2023 PAGE NO:51


CREDIT CARD FRAUD DETECTION

5.3 RESULTS
5.3.1. logistic regression

5.3.2. Support Vector Classifier

5.3.3. Decision Tree Classifier

DEPT OF CSE 2022-2023 PAGE NO:52


CREDIT CARD FRAUD DETECTION

5.3.4. KNN - K-Nearest Neighbour

5.3.5. Naive Bayes

5.3.6. Random Forest Classifier

DEPT OF CSE 2022-2023 PAGE NO:53


CREDIT CARD FRAUD DETECTION

RESULTS

FINDING ALL MODELS SCORES AND ACCURACY

Using Seaborn library and getting Classification Accuracy Comparison of


Models

DEPT OF CSE 2022-2023 PAGE NO:54


CREDIT CARD FRAUD DETECTION

CONCLUSION
Credit card fraud becomes a serious concern to the world. Fraud brings huge financial losses
to the world. This urged Credit card companies have been invested money to create and develop
techniques to reveal and reduce fraud. The prime goal of this study is to define algorithms that
confer the appropriate, and can be adapted by credit card companies for identifying fraudulent
transactions more accurately, in less time and cost. Different machine learning algorithms are
compared, including Logistic Regression, Decision Trees, Random Forest, naive bayes,
Logistic Regression and K-Nearest Neighbours. Because not all scenarios are the same, a
scenario-based algorithm can be used to determine which scenario is the best fit for that
scenario. All of the fraud detection techniques discussed in this survey article have advantages
and disadvantages. The researchers use different performance measures employed (techniques)
and algorithms to predict and show transactions fraudulent. Studies are refreshed and
encouraged to improve the fraud detection basis to determine the weight that is suitable with
cost factors, the tested accuracy, and detection accuracy. Surveys of such kind will allow the
researchers to build a hybrid approach most accurate for fraudulent credit card transaction
detection

DEPT OF CSE 2022-2023 PAGE NO:55


CREDIT CARD FRAUD DETECTION

REFERENCES

[1] S. H. Projects and W. Lovo, ―JMU Scholarly Commons Detecting credit card fraud : An
analysis of fraud detection techniques,‖ 2020.
[2] S. G and J. R. R, ―A Study on Credit Card Fraud Detection using Data Mining
Techniques,‖ Int. J. Data Min. Tech. Appl., vol. 7, no. 1, pp. 21–24, 2018,
doi:10.20894/ijdmta.102.007.001.004.
[3] ―Credit Card Definition.‖https://www.investopedia.com/terms/c/creditcard.asp(accessed
Apr. 03, 2021).
[4] K. J. Barker, J. D‘Amato, and P. Sheridon, ―Credit card fraud: awareness and prevention,‖
J. Finance. Crime, vol. 15, no. 4, pp. 398–410, 2008, doi:10.1108/13590790810907236.
[5] V. N. Dorada and S. Geetha, ―Credit Card Fraud Detection using Machine Learning
Algorithms,‖ Procedia
Computer. Sci., vol. 165, pp. 631–641, 2019, Doi: 10.1016/j.procs.2020.01.057.
[6] A. H. Alhazmi and N. Alekhine, ―A Survey of Credit Card Fraud Detection Use Machine
Learning,‖ 2020 Int. Conf. Computer. Inf. Technol. ICCIT 2020, pp. 10–15, 2020, Doi:
10.1109/ICCIT-144147971.2020.9213809.
[7] B. Wickramanayake, D. K. Garganega, C. Ouyang, and Y. Xu, ―A survey of online card
payment fraud detection using data mining-based methods,‖ arXiv, 2020.
[8] A. Agarwal, ―Survey of Various Techniques used for Credit Card Fraud Detection,‖ Int.
J. Res. Appl. Sci. Eng. Technol., vol. 8, no. 7, pp. 1642–1646, 2020, doi:
0.22214/ijraset.2020.30614.
[9] C. Reviews, ―a Comparative Study : Credit Card Fraud,‖ vol. 7, no. 19, pp. 998–1011,
2020.
[10] R. Sailusha, V. Gnaneswar, R. Ramesh, and G. Ramakoteswara Rao, ―Credit Card Fraud
Detection Using Machine Learning,‖ Proc. Int. Conf. Intell. Compute. Control Syst. ICICCS
2020, no. Iciccs, pp. 1264–1270, 2020, doi: 10.1109/ICICCS48265.2020.9121114

DEPT OF CSE 2022-2023 PAGE NO:56


BITCOIN PRICE PREDICTION

Abstract
In this paper, we proposed to predict the Bitcoin price accurately taking into consideration
various parameters that affect the Bitcoin value. By gathering information from different
reference papers and applying in real time ,I found the advantages and disadvantages of bitcoin
price prediction.
Each and every paper has its own set of methodologies of bitcoin price prediction. Many papers
has accurate price but some other don’t, but the time complexity is higher in those predictions,
so to reduce the time complexity here in this paper we use an algorithm linked to artificial
intelligence Naïve Bayes Algorithm ,Decision Tree Algorithm and Support Vector Machine
(SVM)Algorithm, KNN, random forest classifier, logical regression.
etc.. which do not have a great time management, but of the results from a larger database is
quick and fast. so for this purpose we draw a comparison between other algorithms, this survey
paper helps the upcoming researchers to make an impact in the their papers. The process
happens in the paper is first moment of the research, we aim to understand and find daily trends
in the Bitcoin market while gaining insight into optimal features surrounding Bitcoin price.
Our data set consists of various features relating to the Bitcoin price and payment network over
the course of every years, recorded daily. By pre-processing the dataset, we apply the some
data mining techniques to reduce the noise of data. Then the second moment of our research,
using the available information, we will predict the sign of the daily price change with highest
possible accuracy.

DEPT OF CSE 2022-2023 PAGE NO:57


BITCOIN PRICE PREDICTION

Chapter 1
INTRODUCTION
Bitcoin is a cryptographic money which is utilized worldwide for advanced instalment or
basically for speculation purposes. Bitcoin is decentralized for example it isn't possessed by
anybody. Exchanges made by Bitcoins are simple as they are not attached to any nation.
Speculation should be possible through different commercial centres known as "bitcoin
trades". These enable individuals to sell/purchase Bitcoins utilizing various monetary forms.
The biggest Bitcoin trade is Mt Gox. Bitcoins are put away in an advanced wallet which is
essentially similar to a virtual financial balance. The record of the considerable number of
exchanges, the timestamp information is put away in a spot called Block chain. Each record in
a block chain is known as a square. Each square contains a pointer to a past square of
information. The information on block chain is scrambled. During exchanges the client's name
isn't uncovered, however just their wallet ID is made open. The Bitcoin's worth fluctuates
simply like a stock though in an unexpected way. There are various calculations utilized on
financial exchange information for value forecast. Notwithstanding, the parameters
influencing Bitcoin are extraordinary. In this manner it is important to anticipate the estimation
of Bitcoin so right venture choices can be made. The cost of Bitcoin doesn't rely upon the
business occasions or mediating government not at all like securities exchange. Hence, to
anticipate the worth we feel it is important to use AI innovation to foresee the cost of Bitcoin.
Bitcoin refers to virtual money, which is widely utilized for both transaction and investment
purpose. Bitcoin is a decentralized currency, which implies that it is not owned by a single
person or group. Bitcoins are simple to use since they are not attached to any country. Using a
bitcoin exchange is the best way to invest in bitcoins. Individuals can buy and sell bitcoins by
using a variety of currencies . As of January 2017, 170 hedge funds have been launched in
cryptocurrencies for driving up the demand for the bitcoin in both trading and hedging future.
These many conspiracy theories have been advanced to provide a theory about the causes of
high volatility, and further these ideas have also been used to support the idea that
cryptocurrency values will continue to fluctuate in the future. Another way to look at this is to
indulge in automated bitcoin trading . Figure 1 shows the perspective view of BTC price
prediction.

DEPT OF CSE 2022-2023 PAGE NO:58


BITCOIN PRICE PREDICTION

To forecast BTC values, the machine learning and neural network utilizes numerical historical
data. A recurrent neural network is an artificial neural network with directed graph nodes and
connections that are constructed progressively, similar to synapses in the real brain. LSTM is
an artificial RNN architecture that is commonly used in deep learning in addition to analysing
single data points, which integrates the entire dataset. Virtual currency is the recently evolved
worldwide phenomenon. Thus, it maintains a consistent identity, structure, and function. On
the other hand, it is increasingly recognized as a superior financial medium with significant
potential as time progresses. The development of Bitcoin was intended to reduce the use of
third parties like banks, credit cards, and governments, and decrease transaction time and
money transfer costs. Figure 2 shows the original data for last 5 years BTC price from
registered website source, which is mentioned in the below figure.

Bitcoin is among the virtual currencies with a considerable future ahead of it. Most
cryptocurrencies, especially the most popular ones are mostly bitcoin clones. Because of this,
it gained a lot of interest, and there were several papers published by utilizing both statistical
and machine learning techniques. Statistics is a collection of many techniques that have been
developed over time to provide data summaries and quantify various features of a location,
such as a specific set of observations. To better comprehend ML algorithms, a firm knowledge
should be gained on statistical techniques. While statistical techniques operate within the
process of obtaining some relevant information by properly analysing the dataset, wherein ML
looks for patterns in the dataset and attempts to conclude exactly as humans would. It is
possible to build a time series dataset for Bitcoin by using different choices. Theoretically, the
bitcoin dataset has a granular temporal period. Thus, various periods provide distinct datasets
that may be gathered. Another feature of the bitcoin ecosystem is that all transactions are
transparent to everyone. Researchers may leverage the dataset's causal connection, in addition
to existing blockchain characteristics such as volume to incorporate new features in the
blockchain. Cryptocurrency market participants are those that analyse the influence of
networks on the market. Cryptocurrency exchange rates were used to calculate it. As a result,
bitcoin is considered as the market leader. Consistent network effects were taken into account,
since they provide stronger evidence. The researchers discuss about a well-known machine
learning library and also about the one with the highest number of users. This machine learning
tool is very helpful for developing suitable algorithms. The authors discuss about the library's
simplicity and efficacy to explain the benefits of Sci-Kit Learn. It explains how the library is
integrated into the Python environment, as well as dealing with the implementation challenges
that developers encounter while using this tool.

DEPT OF CSE 2022-2023 PAGE NO:59


BITCOIN PRICE PREDICTION

1.1OBJECTIVE
This paper explains the working of the linear regression and Long Short-Term Memory model
in predicting the value of a Bitcoin. Due to its raising popularity, Bitcoin has become like an
investment and works on the Block chain technology which also gave raise to other crypto
currency. This makes it very difficult to predict its value and hence with the help of Machine
Learning Algorithm and Artificial Neural Network Model this predictor is tested
The primary concern of this research is to find the answer to queries relevant to the
classification of bitcoin price through deep learning schemes using various multi-imaging
modalities. The following queries are considered while designing this comprehensive study.

Types of imaging modalities recently used for bitcoin price prediction.

• Types of the dataset (publicly and private) used to build deep learning classification
models.

• Types of DL and ML classifiers were recently used for bitcoin price prediction.

• Challenges faced by the classifiers in accurately detecting masses.

• Types of parameters used to evaluate bitcoin prices.

DEPT OF CSE 2022-2023 PAGE NO:60


BITCOIN PRICE PREDICTION

CHAPTER 2
LITERATURE SURVEY
We have all considered where bitcoin costs will be one year, two years, five years or even 10
years from now. It's really difficult to anticipate however each and every one of us loves to do
it. Tremendous measures of benefits can be made by purchasing and selling bitcoins, whenever
done accurately.. It has been proven to be a fortune for many people in the past and is still
making them a lot of money today. But this doesn’t come without its downside too. If not
thought of and calculated properly, you can lose a lot of money too. You should have an
incredible comprehension of how and precisely why bitcoin costs change (organic market,
guidelines, news, and so forth), which implies you should realize how individuals make their
bitcoin predictions. Considering these things (supply and demand, regulations, news, etc.), one
must also think about the technology of bitcoin and its progress. This aside, we now have to
deal with the technical parts using various algorithms and technologies which can predict
precise bitcoin prices. Although we came across various models which are currently present
like Naïve Bayes Algorithm ,Decision Tree Algorithm and Support Vector Machine
(SVM)Algorithm, KNN, random forest classifier, logical regression etc. with machine learning
and deep neural network concepts. Normally a time series is a sequence of numbers along time.
This is due to the fact that this being a time series data set, the overall data sets should be split
into two parts: inputs and outputs. Moreover, random forest is great in comparison with the
classic statistics linear models, since it can very easily handle multiple input forecasting
problems. In the second period of our examination we are just focusing in on the bitcoin price
information alone and utilized information at 10 minutes and 10 seconds time frame.. This is
due to the fact that we saw an incredible opportunity to precisely evaluate price predictions at
various levels of granularity and noisiness are modelling. This resulted in incredible results
which had 50 to 55% accuracy in precisely predicting the future bitcoin price changes using 10
minute time intervals.

DEPT OF CSE 2022-2023 PAGE NO:61


BITCOIN PRICE PREDICTION

CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS

SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows

SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda

HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk

DEPT OF CSE 2022-2023 PAGE NO:62


BITCOIN PRICE PREDICTION

CHAPTER 4
SYSTEM DEVELOPMENT PROCESS
4.1 MODEL USED

4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.

4.1.2 SYSTEM DESIGN


The requirement specification of the previous phrase are studied here and the system design is
prepared. System design is used to specify the hardware and the system requirements of a
product. It also helps to define the overall architecture of a system. The software code that has
to be implemented in the next phrase is written here.

4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and this
type of testing is known as unit testing.

4.1.4 INTEGRATION AND TESTING


The units that are developed in the previous phrase are integrated into the system after
performing unit testing for each unit. The software designed has to be tested in order to find
out the error or the flaws in the software. Testing should be done before giving the software to
the client so that the client does not face any problems at the time of installation of the software.

4.1.5 DEPLOYMENT OF THE SYSTEM


Once the testing is done and it is being found that no error or flaws with the product, the product
is released in the market.

DEPT OF CSE 2022-2023 PAGE NO:63


BITCOIN PRICE PREDICTION

4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase modifications
are made to the system in order to improve system performance.

4.2 DATASET USED

4.3 Data Source


The dataset used here for predicting bitcoin price is taken from Kaggle .Kaggle is collection of
database that are used for implementing machine learning algorithms. The dataset used here is
real dataset. The dataset consists of almost 1600 instances of data with the appropriate 6
columns. The parameters of dataset is to test which are taken to the price prediction as like
open value, close value, highest value, market cap etc.

4.4 ALGORITHM USED


This section describes about three algorithms used in this system namely Naïve Bayes
Classifier Algorithm ,Decision tree Classification Algorithm and Support Vector Machine
Algorithm (SVM) ,KNN, Random Forest Classifier, Logical Regression

DEPT OF CSE 2022-2023 PAGE NO:64


BITCOIN PRICE PREDICTION

4.4.1 Naïve Bayes Classifier Algorithm


Naïve Bayes classifier is a supervised algorithm which classifies the dataset on the basis of
Bayes theorem .The Bayes theorem is a rule or the mathematical concept that is used to get the
probability is called Bayes theorem. Bayes theorem requires some independent assumption and
it requires independent variables which is the fundamental assumption of Bayes theorem. Naïve
Bayes is a simple and powerful algorithm for predictive modelling . This model is the most
effective and efficient classification algorithm which can handle massive, complicated, non-
linear, dependent data. Naïve comprises two part namely naïve & Bayes where naïve classifier
assumes that the presence of the particular feature in a class is unrelated to the presence of any
other feature.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:

4.4.2 Decision tree Classification Algorithm :


Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.In a Decision tree, there are two
nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the output of those decisions and
do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions. It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Root node - this main node, on basis of this node all other perform it function
• Interior node - the condition of dependent variables is handled by this node
• Leaf node - the final result is carried on a leaf node

DEPT OF CSE 2022-2023 PAGE NO:65


BITCOIN PRICE PREDICTION

4.4.3 Support Vector Machine Algorithm :


Support Vector Machine is usually represented as SVM. It is an elegant and Powerful
Algorithm. The objective of the support vector machine algorithm is to find a hyper plane in
an N-dimensional space (N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyper planes that could be
chosen. Our objective is to find a plane that has the maximum margin, i.e. the maximum
distance between data points of both classes. Maximizing the margin distance provides some
reinforcement so that future data points can be classified with more confidence .Hyperplanes
and Support Vector :Hyperplanes are decision boundaries that help classify the data points.
Data points falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the number of input
features is 2, then the hyper plane is just a line. If the number of input features is 3,then the
hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number
of features exceeds. In the SVM algorithm, we are looking to maximize the margin between
the data points and the hyperplane. The loss function that helps maximize the margin is hinge
loss.

4.4.4 K-Nearest Neighbour (KNN) Algorithm :


K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. This algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories. K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm. It can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called
a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset. KNN
algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.

4.4.5 Random Forest Algorithm :


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name suggests,
"Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

DEPT OF CSE 2022-2023 PAGE NO:66


BITCOIN PRICE PREDICTION

4.4.6 Logistic Regression algorithm :


Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.

DEPT OF CSE 2022-2023 PAGE NO:67


BITCOIN PRICE PREDICTION

CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries:

• A Pandas Data Frame is a 2 dimensional data structure, like a 2 dimensional array, or


a table with rows and columns. Pandas is mainly used for data analysis and associated
manipulation of tabular data in Data Frames.
• NumPy can be used to perform a wide variety of mathematical operations on arrays. It
adds powerful data structures to Python that guarantee efficient calculations with arrays
and matrices and it supplies an enormous library of high-level mathematical functions
that operate on these arrays and matrices.
• Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots. Make interactive figures that can zoom, pan, update.
• Seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics
• Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
via a consistence interface in Python
Importing data set

Data pre-processing:-
Data pre-processing, a component of data preparation, describes any type of processing
performed on raw data to prepare it for another data processing procedure. It has traditionally
been an important preliminary step for the data mining process

DEPT OF CSE 2022-2023 PAGE NO:68


BITCOIN PRICE PREDICTION

1.Information elements collated on a number of individuals, typically used for the purposes of
making comparisons or identifying patterns

2.We can use data.head To get first n rows

3.Generate descriptive statistics

DATA CLEANING AND TRANSFORMATION


1. The ISNULL() function returns a specified value if the expression is NULL.

DEPT OF CSE 2022-2023 PAGE NO:69


BITCOIN PRICE PREDICTION

2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.

3. The dropna() method removes the rows that contains NULL values.

4. The fillna() method replaces the NULL values with a specified value

DEPT OF CSE 2022-2023 PAGE NO:70


BITCOIN PRICE PREDICTION

5. Interpolate() function is basically used to fill NA values in the data frame or series.

6. Removes the rows or the columns that contains NULL values

7. describes the data structure provided by pandas

8. Method returns count of the unique values in the data frame

DEPT OF CSE 2022-2023 PAGE NO:71


BITCOIN PRICE PREDICTION

Visualizing the data


1.The process of finding trends and correlations in our data by representing it pictorially is
called Data Visualization. To perform data visualization in python, we can use various
python data visualization modules such as Matplotlib, Seaborn, Plot

2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False Negative
classification categories for binary classification.

DEPT OF CSE 2022-2023 PAGE NO:72


BITCOIN PRICE PREDICTION

3.Plotting the graph Using seaborn library and finding the correlation

4. using seaborn library and finding the frequency distribution

DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split,
one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe and
learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared against
the previous sets of data. The testing set acts as an evaluation of the final mode and algorithm.

DEPT OF CSE 2022-2023 PAGE NO:73


BITCOIN PRICE PREDICTION

5.2 APPLYING ALGORITHMS:


5.2.1. LOGISTIC REGRESSION

5.2.2. Support Vector Classifier

DEPT OF CSE 2022-2023 PAGE NO:74


BITCOIN PRICE PREDICTION

5.2.3. Decision Tree Classifier

5.2.4. KNN - K-Nearest Neighbour

5.2.5. Naive Bayes

5.2.6. Random Forest Classifier

DEPT OF CSE 2022-2023 PAGE NO:75


BITCOIN PRICE PREDICTION

5.3 RESULTS
5.3.1. logistic regression

5.3.2. Support Vector Classifier

5.3.3. Decision Tree Classifier

DEPT OF CSE 2022-2023 PAGE NO:76


BITCOIN PRICE PREDICTION

5.3.4. KNN - K-Nearest Neighbour

5.3.5. Naive Bayes

5.3.6. Random Forest Classifier

DEPT OF CSE 2022-2023 PAGE NO:77


BITCOIN PRICE PREDICTION

RESULTS
FINDING ALL MODELS SCORES AND ACCURACY

Using Seaborn library and getting Classification Accuracy Comparison of


Models

DEPT OF CSE 2022-2023 PAGE NO:78


BITCOIN PRICE PREDICTION

CONCLUSION :

report dataset can not only be classified with the previously mentioned algorithms from machi
ne learning, there are many algorithms and techniques which may perform better than these. P
roduction of accurate classifier which perform efficiently for medicinal application is the mai
n challenge we face in machine learning. Four main algorithms were implemented in this Syst
em were Naïve Bayes Algorithm, Decision Tree Algorithm, KNN, Random Forest Classifier,
Logical Regression and SVM Algorithm. Our main aim for the research is to discover the alg
orithm which performs faster, accurate and efficiently. Random forest surpasses all the other a
lgorithms with an accuracy of 11.311053984575%.Thus I Conclude, this project by saying Ra
ndom forest Classification algorithm is best and better for handling this type of data set. In the
future, the designed system with the used machine learning classification algorithm can be us
ed to predict or diagnose other price prediction . The work can be extended or improved for th
e automation of bitcoin price analysis including some other machine learning algorithms

Future Scope
• To work on a better User Interface so that people can access these data easily and effortlessly.
• Implementing IOT model for smart automatic analysis.
• Implementing more algorithms to find out the best method for predicting the crypto currency

DEPT OF CSE 2022-2023 PAGE NO:79


BITCOIN PRICE PREDICTION

REFERENCES
1) Sin E, Wang L. Bitcoin price prediction using ensembles of neural networks. In: 2017 13th
International conference on natural computation, fuzzy systems and knowledge discovery.
IEEE.2017;p. 666–671. doi:10.1109/FSKD.2017.8393351.
2) Shikhara A, Singh AK, Nagaya S, Saini PK. Bitcoin Price Alert and Prediction System using
various Models. IOP Conference Series: Materials Science and Engineering.
2021;1131(1):012009. Available from: https://dx.doi.org/10.1088/1757-899x/1131/1/012009.
3) Mittal R, Arora S, Bhatia MP. Automated cryptocurrencies prices prediction using machine
learning. ICTACT Journal on Soft Computing. 2018;8
(4):1758– 1761. Available from:
http://ictactjournals.in/paper/IJSC_Vol_8_Iss_4_Paper_8_1758_1761.pdf. 4) Nakamoto S.
Bitcoin: A peer-to-peer electronic cash system. 2019. Available from:
https://git.dhimmel.com/bitcoin-whitepaper/.
5) Sebastião H, Godinho P. Forecasting and trading cryptocurrencies with machine learning
under changing market conditions. Financial Innovation. 2021;7(1):1–30. Available from:
https://dx.doi.org/10.1186/s40854-020-00217-x.
6) Parasailed T, Nonmodal T. Machine learning models comparison for bitcoin price
prediction. 10th International Conference on Information Technology and Electrical
Engineering. 2018;p. 506–511. Available from: 10.1109/ICITEED.2018.8534911.
7) Jaquart P, Dann D, Weinhardt C. Short-term bitcoin market prediction via machine learning.
The Journal of Finance and Data Science. 2021;7:45–66. Available from:
10.1016/j.jfds.2021.03.001.
8) Rane PV, Dhage SN. Systematic erudition of bitcoin price prediction using machine learning
techniques. 5th International Conference on Advanced Computing & Communication Systems
(ICACCS). 2019;p. 594–598. Available from: 10.1109/ICACCS.2019.8728424.
9) Roy R, Roy S, Hossain MN, Allam MZ, Nazmul N. Study on nonlinear partial differential
equation by implementing MSE method. Global Scientific Journals. 2020;8(1):1651–1665.
10) McKinny W. Pandas: a foundational Python library for data analysis and statistics. Python
for High Performance and Scientific Computing. 2011;14(9):1–9. Available from:
https://www.dlr.de/sc/portaldata/15/resources/dokumente/pyhpc2011/submissions/pyhpc2011
_submission_9.pdf.

DEPT OF CSE 2022-2023 PAGE NO:80

You might also like