Professional Documents
Culture Documents
Abstract
Breast Cancer is leading cause of death among women’s. According to Cancer Report
Breast cancer is seems constantly increasing all over worldwide in past years and Its a Most
dreadful disease for women’s. Even medical field has enormous amount of data, certain tools
and techniques are needed to handle those data. Classification Techniques is one of main
techniques often used. This system Predict arising possibilities of Breast Cancer using
Classification Technique .This system provide the chances of occurring Breast cancer in
terms of percentage. The real time dataset is used in this system in order to obtain exact
prediction. The datasets are processed in Python Programming Language using three main
Machine Learning Algorithms namely Naïve Bayes Algorithm ,Decision Tree Algorithm and
Support Vector Machine (SVM)Algorithm, KNN, random forest classifier, logical regression.
The aim of the system to shows which algorithms are best to use in order perform prediction
tasks in medical Filed. Algorithm results are calculated in terms of accuracy rate and
efficiency and effectiveness of each algorithm
Chapter 1
INTRODUCTION
Machine learning is the study of algorithms and statistical models that the computer systems
use to effectively perform a specific task without using any explicit instructions. Machine
learning is one of the small part of intelligence, and refers to a specific sub-part of Artificial
Intelligence is related to constructing algorithms that can make to accurate predictions about
future results. Machine learning algorithms build a mathematical model of some data, known
as “training set” in order to make predictions without being explicitly programmed to perform
the task. Classification rules are typically useful for medical problems that have been applied
mainly in the area of medical field.
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would have
ever come across. As it is evident from the name, it gives the computer that which makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.
Data mining is extracting information and knowledge from huge amount of data. Data mining
is an essential step in discovering knowledge from databases. There are numbers of databases,
data marts, data warehouses all over the world. Data Mining is mainly used to extract the hidden
information from a large amount of database. Data mining is also called as Knowledge
Discovery Database (KDD).
The data mining has four main techniques namely Classification , Clustering, Regression, and
Association rule. Data mining techniques have the ability to rapidly mine vast amount of data.
Data mining is mainly needed in many fields to extract useful information from a large amount
of data. The fields like the medical field, business field, and educational field have a vast
amount of data, thus these fields data can be mined through those techniques more useful
information. Data mining techniques can be implemented through a machine learning
algorithm. Each technique can be extended using certain machine learning models.
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data pre-processing task.
1.2 OBJECTIVE
The proposed machine-learning approaches could predict breast cancer as the early
detection of this disease could help slow down the progress of the disease and reduce
the mortality rate through appropriate therapeutic interventions at the right time.
Applying different machine learning approaches, accessibility to bigger datasets from
different institutions (multi-centre study), and considering key features from a variety
of relevant data sources could improve the performance of modelling.
The primary concern of this research is to find the answer to queries relevant to the
classification of breast cancer through deep learning schemes using various multi-
imaging modalities. The following queries are considered while designing this
comprehensive study.
2. Types of the dataset (publicly and private) used to build deep learning
classification models.
CHAPTER 2
LITERATURE SURVEY
Cancer is the second death-causing disease that affects worldwide women. Cancer is a
disorder range of the lethal cell if left untreated leads to indolent lesions and mortality.
Abnormal cells are created as a result of a genetic mutation that grows out of control and
becomes cancerous due to the changes in its deoxyribonucleic acid. Benign (a noncancerous
tumour) does not invade neighbouring tissue while malignant (cancerous tumour) spread in
multiple body functions via the lymphatic system and elicits nutrients from the body tissues.
The most dominant cancer types are lymphoma, sarcoma, carcinoma, leukaemia, and
melanoma. Carcinomas is the most widely diagnosed form of cancers.
The breast tissues are comprised of various connective tissue, blood vessels, lymph nodes,
and lymph vessels. (Figure 1a) shows the anatomy of the female breast. It often establishes,
when the breast tissues grow abnormally and cell division is not controlled that results in the
formation of a tumour. The developed tumour can be invasive or non-invasive which usually
starts in milk ducts or the lobules. Invasive cancer may start in lymph nodes which spreads in
different organs using blood vessels but cancerous cells often remain separated from the
tumour. Moreover, breast cancer is classified into various subtypes based on their
morphology, shape, and structure
Early identification of breast cancer can assist in the prognosis process which can
successfully mitigate serious complications of the disease with higher recovery. Various
medical multi-imaging modalities such as digital mammography breast X-ray images
(DMG), Ultrasound sonograms (ULS), magnetic-resonance-imaging (MRI), Biopsy
(Histological images), and computerized thermography (CT) are exercised for breast cancer
screening and classification. The auto-detection of lesions, lesions volume and its contour in
mammography images is a prominent sign which is most significant in detecting the distorted
edge of the malignant and smooth edge of benign tumour. (Figure 1b) demonstrates the
benign and malignant masses in a digital mammogram. It truly helps radiologist’s in
investigating malignancy and quickly analysing the lesions to forbid avoidable biopsies.
Initially, the radiologists analyse the images manually and final decisions are suggested after
the mutual consensus of other experts. The availability of many radiologists at the same time
in under-developed countries is a key issue. Moreover, the precise analysis of the multi-class
images depends upon the experiences and domain knowledge of the radiologist.
Recently, various machine learning (ML), artificial intelligence (AI), and neural network
schemes are exercised for image processing. The key achievement of the CAD system is to
build an authentic and reliable system that can limit experimental oversights and can assist in
separating benign and malignant lesions with higher accuracy. These systems are used to
enhance image quality for human judgment and to automate the readability process of images
for better understanding and interpretation. Currently, various articles on breast cancer
detections, segmentation, and classification using ML and AI techniques have been
published. Most of the previous studies emphasized ML schemes using binary classification
for the detection of certain cancer like lung cancer, brain cancer, skin cancer, stomach cancer,
kidney cancer, and breast cancer.
Jaffar et a and Khan et al proposed a novel deep-learning-based model for breast cancer
screening and classification using mammographic images. Qiu et al proposed a technique
based on deep learning methods that classify the breast masses without lesions segmentation
and feature selection. Samala et al performed breast cancer binary classification by reducing
the computational complexities of all types of mammographic images. Nascimento et al.
extracted the morphological features from ULS images using binary classification. Youk et al
proposed a new ULS technique named as Elastography to differentiate the benign and
malignant lesions of breast cancer. The authors , developed deep-learning-based techniques
for suspicious ROI segmentation and classification using MRI modalities. Rasti et al.
developed a robust DL model for ROI segmentation and breast tumour classification using
segmented DCE-MRI images. De Nazar et al proposed a model by selecting the variable
value of the threshold for the segmentation of breast masses. Choi et al designed a CAD
model to extracts the ROI before the breast cancer classification. The ROI extraction is the
seclusion abnormal breast tissues from irrelevant regions that increase the accuracy and also
the big number of images needed for training and testing. Casti et al used QDA-LDA model
for auto-localization and classification of asymmetry ROI because it directly related to the
accuracy of doctor’s predicting and treatment Nahid et al. [33] proposed an approach that
extracts ROI patches from HP images for the classification of invasive and non-invasive
breast cancer by CNN. Bejnordi et al. and Feng et al performed a biopsy to classify the breast
WSIs into different categories through the deep-convolution neutral network and achieves the
highest accuracy in binary-classification of cancerous slides. Punitha et alused the
depigmentation technique to overcomes the merging of the neighbour region problems that
almost have similar properties. Strange et al focused on the classification and distribution of
microcalcification based on the topological model and morphological aspects.
The key objective of this review to assists the researchers in developing a novel and robust
CAD tool which is computationally efficient and can help radiologist during the classification
of breast abnormalities. This comprehensive review has exploited key research directions
based on various multi-image modalities, image segmentation approaches, feature extraction
techniques, types of DL and ML algorithms, and performance parameters used to evaluate the
classification models. Statistical analysis of CAD systems considering different aspects is
also highlighted through graphical and tabular representations. Following are the key research
findings:
As per literature, it is observed that there are huge variations in shapes of breast (abnormal)
tissues, so the benchmarks can be taken off during the screening process. The micro-
calcification morphology is another significant factor for defining ROI, which is based on the
distance between each micro-calcification. A fixed-scale approach is based on the distance
between individual calcification used for defining the micro-calcification cluster while the
invariant-scale is a pixel-level novel approach that visualizes the various morphology aspects
(i.e., calcification cluster shape, size, density, and distribution) to the radiologist.
Furthermore, histogram-based methods and selection of optimal threshold is an efficient
approach for the segmentation and classification of masses and calcification. From literature,
it is also evident that none of a study has implemented this approach before. A novel CAD
system needs to be developed based on this approach to classify the calcification and masses.
A content-based image retrieval is a new approach based on mammogram indexing and ROI
patches classification. From literature, it is found that none of a study used indexing on ROI
patches to classify calcification and mass using a mammogram. However, indexing and ROI
classification-based CAD system needs to be developed with the help of expert radiologist to
get precise results. Furthermore, some challenges faced by DL algorithms for breast cancer
diagnostics are related to ultrasound images because of its low signal-to-noise ratio (SNR)
comparative to others. However, echogram is a new ULS imaging technology, which is much
cheaper for breast screening. So, the development of a new DL algorithm is a significant task
to break through the echogram image analysis. The CT or MRI image modalities are spatial
3D data which are very large in size and need higher computation resources. However, the
design of light models is an interesting research direction for training and inferencing.
CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows
SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda
HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk
CHAPTER 4
SYSTEM DEVELOPMENT PROCESS
4.1 MODEL USED
4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.
4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and
this type of testing is known as unit testing.
4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase
modifications are made to the system in order to improve system performance.
of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries
Data pre-processing:-
Data pre-processing, a component of data preparation, describes any type of processing
performed on raw data to prepare it for another data processing procedure. It has traditionally
been an important preliminary step for the data mining process
1.Information elements collated on a number of individuals, typically used for the purposes
of making comparisons or identifying patterns
2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.
3. The dropna() method removes the rows that contains NULL values.
4. The fillna() method replaces the NULL values with a specified value
5. Interpolate() function is basically used to fill NA values in the data frame or series.
2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False
Negative classification categories for binary classification.
3.Plotting the graph Using seaborn library and finding the correlation
4. using confusion matrix and finding the predicted values and actual value
DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe
and learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared
against the previous sets of data. The testing set acts as an evaluation of the final mode and
algorithm.
RESULTS
RESULTS
CONCLUSION :
Medical dataset can not only be classified with the previously mentioned algorithms from
machine learning, there are many algorithms and techniques which may perform better
than these. Production of accurate classifier which perform efficiently for medicinal
application is the main challenge we face in machine learning. Four main algorithms were
implemented in this System were Naïve Bayes Algorithm, Decision Tree Algorithm,
KNN, Random Forest Classifier, Logical Regression and SVM Algorithm. Our main aim
for the research is to discover the algorithm which performs faster, accurate and
efficiently. Logical Regression surpasses all the other algorithms with an accuracy of
85.5964 %.Thus I Conclude, this project by saying Logical Regression Classification
algorithm is best and better for handling medical data set. In the future, the designed
system with the used machine learning classification algorithm can be used to predict or
diagnose other diseases. The work can be extended or improved for the automation of
Breast cancer analysis including some other machine learning algorithms
REFERENCES
[1] Jemal A, Murray T, Ward E, Samuels A, Tiwari RC, Ghafoor A, Feuer EJ, Thun MJ.
Cancer statistics, 2005. CA: a cancer journal for clinicians. 2005 Jan 1;55(1):10-30.
[2] Polat K, Güneş S. Breast cancer diagnosis using least square support vector machine.
Digital Signal Processing. 2007 Jul 1;17(4):694- 701.
[3] Akay MF. Support vector machines combined with feature selection for breast cancer
diagnosis. Expert systems with applications. 2009 Mar 1;36(2):3240-7.
[4] Yeh WC, Chang WW, Chung YY. A new hybrid approach for mining breast cancer
pattern using discrete particle swarm optimization and statistical method. Expert Systems
with Applications. 2009 May 1;36(4):8204-11.
[5] Marcano-Cedeño A, Quintanilla-Domínguez J, Andina D. WBCD breast cancer database
classification applying artificial metaplasticity neural network. Expert Systems with
Applications. 2011 Aug 1;38(8):9573-9.
[6] Kaya Y, Uyar M. A hybrid decision support system based on rough set and extreme
learning machine for diagnosis of hepatitis disease. Applied Soft Computing. 2013 Aug
1;13(8):3429-38.
[7] Nahato KB, Harichandran KN, Arputharaj K. Knowledge mining from clinical datasets
using rough sets and backpropagation neural network. Computational and mathematical
methods in medicine. 2015;2015.
[8] Liu L, Deng M. An evolutionary artificial neural network approach for breast cancer
diagnosis. In Knowledge Discovery and Data Mining, 2010. WKDD'10. Third International
Conference on 2010 Jan 9 (pp. 593-596). IEEE.
[9] Chen HL, Yang B, Liu J, Liu DY. A support vector machine classifier with rough set-
based feature selection for breast cancer diagnosis. Expert Systems with Applications. 2011
Jul 1;38(7):9014-22
Abstract
It is vital that credit card companies are able to identify fraudulent credit card
transactions so that customers are not charged for items that they did not purchase. Such
problems can be tackled with Data Science and its importance, along with Machine Learning,
cannot be overstated. This project intends to illustrate the modelling of a data set using
machine learning with Credit Card Fraud Detection. The Credit Card Fraud Detection Problem
includes modelling past credit card transactions with the data of the ones that turned out to be
fraud. This model is then used to recognize whether a new transaction is fraudulent or
not. Our objective here is to detect 100% of the fraudulent transactions while minimizing
the incorrect fraud classifications. Credit Card Fraud Detection is a typical sample of
classification. In this process, we have focused on analysing and pre-processing data sets
as well as the deployment of multiple anomaly detection algorithms such as Local Outlier
Factor and Isolation Forest algorithm on the PCA transformed Credit Card Transaction data.
Chapter 1
INTRODUCTION
OBJCTIVE
The key objective of any credit card fraud detection system is to identify suspicious events and
report them to an analyst while letting normal transactions be automatically processed.
For years, financial institutions have been entrusting this task to rule-based systems that employ
rule sets written by experts. But now they increasingly turn to a machine learning approach, as
it сan bring significant improvements to the process.
2. Less manual work needed for additional verification. Enhanced accuracy leads reduces the
burden on analysts. “People are unable to check all transactions manually, even if we are
talking about a small bank,” Alexander Konduforov, data science competence leader at
AltexSoft, explains. “ML-driven systems filter out, roughly speaking, 99.9 percent of normal
patterns leaving only 0.1 percent of events to be verified by experts.”
3. Fewer false declines. False declines or false positives happen when a system identifies a
legitimate transaction as suspicious and wrongly cancels it.
4. Ability to identify new patterns and adapt to changes. Unlike rule-based systems, ML
algorithms are aligned with a constantly changing environment and financial conditions. They
enable analysts to identify new suspicious patterns and create new rules to prevent new types
of scams.
CHAPTER 2
LITERATURE SURVEY
Prajwal Save et al. [18] have proposed a model based on a decision tree and a combination
of Luhn's and Hunt’s algorithms. Luhn's algorithm is used to determine whether an incoming
transaction is fraudulent or not. It validates credit card numbers via the input, which is the
credit card number. Address Mismatch and Degree of Out liernes are used to assess the
deviation of each incoming transaction from the cardholder’s normal profile. In the final
step, the general belief is strengthened or weakened using Bayes Theorem, followed by
recombination of the calculated probability with the initial belief of fraud using an
advanced combination heuristic. Vimala Devi. J et al. [19] To detect counterfeit transactions,
three machine-learning algorithms were presented and implemented. There are many
measures used to evaluate the performance of classifiers or predictors, such as the Vector
Machine, Random Forest, and Decision Tree. These metrics are either prevalence-dependent
or prevalence-independent. Furthermore, these techniques are used in credit card fraud
detection mechanisms, and the results of these algorithms have been compared. Popat and
Chaudhary [20] supervised algorithms were presented Deep learning, Logistic Regression,
Nave Bayesian, Support Vector Machine (SVM), Neural Network, Artificial Immune
System, K Nearest Neighbour, Data Mining, Decision Tree, Fuzzy logic based System,
and Genetic Algorithm are some of the techniques used. Credit card fraud detection
algorithms identify transactions that have a high probability of being fraudulent. We
compared machine-learning algorithms to prediction, clustering, and outlier detection. Shi
yang Xuan et al. [21] For training the behavioural characteristics of credit card transactions,
the Random Forest classifier was used. The following types are used to train the normal
and fraudulent behaviour features Random forest-based on random trees and random forest
based on CART. To assess the model's effectiveness, performance measures are
computed. Dornadula and Geetha S. [5] Using the Sliding-Window method, the
transactions were aggregated into respective groups, i. , some features from the window were
extracted to find cardholder's behavioural patterns. Features such as the maximum amount,
the minimum amount of a transaction, the average amount in the window, and even the time
elapsed are available. Sangeeta Mittal et al. [22] To evaluate the underlying problems,
some popular machine learning- algorithms in the supervised and unsupervised categories
were selected. A range of supervised learning algorithms, from classical to modern, have
been considered. These include tree-based algorithms, classical and deep neural networks,
hybrid algorithms and Bayesian approaches. The effectiveness of machine-learning
algorithms in detecting credit card fraud has been assessed. On various metrics, a number
CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows
SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda
HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk
CHAPTER 4
4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.
4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and this
type of testing is known as unit testing.
4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase modifications
are made to the system in order to improve system performance.
each feature individually contributes to identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries
Data pre-processing:-
1.Information elements collated on a number of individuals, typically used for the purposes of
making comparisons or identifying patterns
2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.
3. The dropna() method removes the rows that contains NULL values.
4. The fillna() method replaces the NULL values with a specified value
5. Interpolate() function is basically used to fill NA values in the data frame or series.
2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False Negative
classification categories for binary classification.
3.Plotting the graph Using seaborn library and finding the correlation
4. using confusion matrix and finding the predicted values and actual value
DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split,
one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe and
learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared against
the previous sets of data. The testing set acts as an evaluation of the final mode and algorithm.
5.3 RESULTS
5.3.1. logistic regression
RESULTS
CONCLUSION
Credit card fraud becomes a serious concern to the world. Fraud brings huge financial losses
to the world. This urged Credit card companies have been invested money to create and develop
techniques to reveal and reduce fraud. The prime goal of this study is to define algorithms that
confer the appropriate, and can be adapted by credit card companies for identifying fraudulent
transactions more accurately, in less time and cost. Different machine learning algorithms are
compared, including Logistic Regression, Decision Trees, Random Forest, naive bayes,
Logistic Regression and K-Nearest Neighbours. Because not all scenarios are the same, a
scenario-based algorithm can be used to determine which scenario is the best fit for that
scenario. All of the fraud detection techniques discussed in this survey article have advantages
and disadvantages. The researchers use different performance measures employed (techniques)
and algorithms to predict and show transactions fraudulent. Studies are refreshed and
encouraged to improve the fraud detection basis to determine the weight that is suitable with
cost factors, the tested accuracy, and detection accuracy. Surveys of such kind will allow the
researchers to build a hybrid approach most accurate for fraudulent credit card transaction
detection
REFERENCES
[1] S. H. Projects and W. Lovo, ―JMU Scholarly Commons Detecting credit card fraud : An
analysis of fraud detection techniques,‖ 2020.
[2] S. G and J. R. R, ―A Study on Credit Card Fraud Detection using Data Mining
Techniques,‖ Int. J. Data Min. Tech. Appl., vol. 7, no. 1, pp. 21–24, 2018,
doi:10.20894/ijdmta.102.007.001.004.
[3] ―Credit Card Definition.‖https://www.investopedia.com/terms/c/creditcard.asp(accessed
Apr. 03, 2021).
[4] K. J. Barker, J. D‘Amato, and P. Sheridon, ―Credit card fraud: awareness and prevention,‖
J. Finance. Crime, vol. 15, no. 4, pp. 398–410, 2008, doi:10.1108/13590790810907236.
[5] V. N. Dorada and S. Geetha, ―Credit Card Fraud Detection using Machine Learning
Algorithms,‖ Procedia
Computer. Sci., vol. 165, pp. 631–641, 2019, Doi: 10.1016/j.procs.2020.01.057.
[6] A. H. Alhazmi and N. Alekhine, ―A Survey of Credit Card Fraud Detection Use Machine
Learning,‖ 2020 Int. Conf. Computer. Inf. Technol. ICCIT 2020, pp. 10–15, 2020, Doi:
10.1109/ICCIT-144147971.2020.9213809.
[7] B. Wickramanayake, D. K. Garganega, C. Ouyang, and Y. Xu, ―A survey of online card
payment fraud detection using data mining-based methods,‖ arXiv, 2020.
[8] A. Agarwal, ―Survey of Various Techniques used for Credit Card Fraud Detection,‖ Int.
J. Res. Appl. Sci. Eng. Technol., vol. 8, no. 7, pp. 1642–1646, 2020, doi:
0.22214/ijraset.2020.30614.
[9] C. Reviews, ―a Comparative Study : Credit Card Fraud,‖ vol. 7, no. 19, pp. 998–1011,
2020.
[10] R. Sailusha, V. Gnaneswar, R. Ramesh, and G. Ramakoteswara Rao, ―Credit Card Fraud
Detection Using Machine Learning,‖ Proc. Int. Conf. Intell. Compute. Control Syst. ICICCS
2020, no. Iciccs, pp. 1264–1270, 2020, doi: 10.1109/ICICCS48265.2020.9121114
Abstract
In this paper, we proposed to predict the Bitcoin price accurately taking into consideration
various parameters that affect the Bitcoin value. By gathering information from different
reference papers and applying in real time ,I found the advantages and disadvantages of bitcoin
price prediction.
Each and every paper has its own set of methodologies of bitcoin price prediction. Many papers
has accurate price but some other don’t, but the time complexity is higher in those predictions,
so to reduce the time complexity here in this paper we use an algorithm linked to artificial
intelligence Naïve Bayes Algorithm ,Decision Tree Algorithm and Support Vector Machine
(SVM)Algorithm, KNN, random forest classifier, logical regression.
etc.. which do not have a great time management, but of the results from a larger database is
quick and fast. so for this purpose we draw a comparison between other algorithms, this survey
paper helps the upcoming researchers to make an impact in the their papers. The process
happens in the paper is first moment of the research, we aim to understand and find daily trends
in the Bitcoin market while gaining insight into optimal features surrounding Bitcoin price.
Our data set consists of various features relating to the Bitcoin price and payment network over
the course of every years, recorded daily. By pre-processing the dataset, we apply the some
data mining techniques to reduce the noise of data. Then the second moment of our research,
using the available information, we will predict the sign of the daily price change with highest
possible accuracy.
Chapter 1
INTRODUCTION
Bitcoin is a cryptographic money which is utilized worldwide for advanced instalment or
basically for speculation purposes. Bitcoin is decentralized for example it isn't possessed by
anybody. Exchanges made by Bitcoins are simple as they are not attached to any nation.
Speculation should be possible through different commercial centres known as "bitcoin
trades". These enable individuals to sell/purchase Bitcoins utilizing various monetary forms.
The biggest Bitcoin trade is Mt Gox. Bitcoins are put away in an advanced wallet which is
essentially similar to a virtual financial balance. The record of the considerable number of
exchanges, the timestamp information is put away in a spot called Block chain. Each record in
a block chain is known as a square. Each square contains a pointer to a past square of
information. The information on block chain is scrambled. During exchanges the client's name
isn't uncovered, however just their wallet ID is made open. The Bitcoin's worth fluctuates
simply like a stock though in an unexpected way. There are various calculations utilized on
financial exchange information for value forecast. Notwithstanding, the parameters
influencing Bitcoin are extraordinary. In this manner it is important to anticipate the estimation
of Bitcoin so right venture choices can be made. The cost of Bitcoin doesn't rely upon the
business occasions or mediating government not at all like securities exchange. Hence, to
anticipate the worth we feel it is important to use AI innovation to foresee the cost of Bitcoin.
Bitcoin refers to virtual money, which is widely utilized for both transaction and investment
purpose. Bitcoin is a decentralized currency, which implies that it is not owned by a single
person or group. Bitcoins are simple to use since they are not attached to any country. Using a
bitcoin exchange is the best way to invest in bitcoins. Individuals can buy and sell bitcoins by
using a variety of currencies . As of January 2017, 170 hedge funds have been launched in
cryptocurrencies for driving up the demand for the bitcoin in both trading and hedging future.
These many conspiracy theories have been advanced to provide a theory about the causes of
high volatility, and further these ideas have also been used to support the idea that
cryptocurrency values will continue to fluctuate in the future. Another way to look at this is to
indulge in automated bitcoin trading . Figure 1 shows the perspective view of BTC price
prediction.
To forecast BTC values, the machine learning and neural network utilizes numerical historical
data. A recurrent neural network is an artificial neural network with directed graph nodes and
connections that are constructed progressively, similar to synapses in the real brain. LSTM is
an artificial RNN architecture that is commonly used in deep learning in addition to analysing
single data points, which integrates the entire dataset. Virtual currency is the recently evolved
worldwide phenomenon. Thus, it maintains a consistent identity, structure, and function. On
the other hand, it is increasingly recognized as a superior financial medium with significant
potential as time progresses. The development of Bitcoin was intended to reduce the use of
third parties like banks, credit cards, and governments, and decrease transaction time and
money transfer costs. Figure 2 shows the original data for last 5 years BTC price from
registered website source, which is mentioned in the below figure.
Bitcoin is among the virtual currencies with a considerable future ahead of it. Most
cryptocurrencies, especially the most popular ones are mostly bitcoin clones. Because of this,
it gained a lot of interest, and there were several papers published by utilizing both statistical
and machine learning techniques. Statistics is a collection of many techniques that have been
developed over time to provide data summaries and quantify various features of a location,
such as a specific set of observations. To better comprehend ML algorithms, a firm knowledge
should be gained on statistical techniques. While statistical techniques operate within the
process of obtaining some relevant information by properly analysing the dataset, wherein ML
looks for patterns in the dataset and attempts to conclude exactly as humans would. It is
possible to build a time series dataset for Bitcoin by using different choices. Theoretically, the
bitcoin dataset has a granular temporal period. Thus, various periods provide distinct datasets
that may be gathered. Another feature of the bitcoin ecosystem is that all transactions are
transparent to everyone. Researchers may leverage the dataset's causal connection, in addition
to existing blockchain characteristics such as volume to incorporate new features in the
blockchain. Cryptocurrency market participants are those that analyse the influence of
networks on the market. Cryptocurrency exchange rates were used to calculate it. As a result,
bitcoin is considered as the market leader. Consistent network effects were taken into account,
since they provide stronger evidence. The researchers discuss about a well-known machine
learning library and also about the one with the highest number of users. This machine learning
tool is very helpful for developing suitable algorithms. The authors discuss about the library's
simplicity and efficacy to explain the benefits of Sci-Kit Learn. It explains how the library is
integrated into the Python environment, as well as dealing with the implementation challenges
that developers encounter while using this tool.
1.1OBJECTIVE
This paper explains the working of the linear regression and Long Short-Term Memory model
in predicting the value of a Bitcoin. Due to its raising popularity, Bitcoin has become like an
investment and works on the Block chain technology which also gave raise to other crypto
currency. This makes it very difficult to predict its value and hence with the help of Machine
Learning Algorithm and Artificial Neural Network Model this predictor is tested
The primary concern of this research is to find the answer to queries relevant to the
classification of bitcoin price through deep learning schemes using various multi-imaging
modalities. The following queries are considered while designing this comprehensive study.
• Types of the dataset (publicly and private) used to build deep learning classification
models.
• Types of DL and ML classifiers were recently used for bitcoin price prediction.
CHAPTER 2
LITERATURE SURVEY
We have all considered where bitcoin costs will be one year, two years, five years or even 10
years from now. It's really difficult to anticipate however each and every one of us loves to do
it. Tremendous measures of benefits can be made by purchasing and selling bitcoins, whenever
done accurately.. It has been proven to be a fortune for many people in the past and is still
making them a lot of money today. But this doesn’t come without its downside too. If not
thought of and calculated properly, you can lose a lot of money too. You should have an
incredible comprehension of how and precisely why bitcoin costs change (organic market,
guidelines, news, and so forth), which implies you should realize how individuals make their
bitcoin predictions. Considering these things (supply and demand, regulations, news, etc.), one
must also think about the technology of bitcoin and its progress. This aside, we now have to
deal with the technical parts using various algorithms and technologies which can predict
precise bitcoin prices. Although we came across various models which are currently present
like Naïve Bayes Algorithm ,Decision Tree Algorithm and Support Vector Machine
(SVM)Algorithm, KNN, random forest classifier, logical regression etc. with machine learning
and deep neural network concepts. Normally a time series is a sequence of numbers along time.
This is due to the fact that this being a time series data set, the overall data sets should be split
into two parts: inputs and outputs. Moreover, random forest is great in comparison with the
classic statistics linear models, since it can very easily handle multiple input forecasting
problems. In the second period of our examination we are just focusing in on the bitcoin price
information alone and utilized information at 10 minutes and 10 seconds time frame.. This is
due to the fact that we saw an incredible opportunity to precisely evaluate price predictions at
various levels of granularity and noisiness are modelling. This resulted in incredible results
which had 50 to 55% accuracy in precisely predicting the future bitcoin price changes using 10
minute time intervals.
CHAPTER 3
SOFTWARE AND HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
OPERATING SYSTEM
• Windows
SOFTWARE TOOLS
• jupyter notebook
• Python
• anaconda
HARDWARE REQUIREMENTS
• Processor i3 and above
• 4 GB RAM
• 500GB hard disk
CHAPTER 4
SYSTEM DEVELOPMENT PROCESS
4.1 MODEL USED
4.1.1 REQUIREMENTS
This is the first phrase of the model. This phrase defines what needs to be designed, what are
its functions and what is the purpose. Specification of the input, output or of the final product
is studied in this phrase.
4.1.3 IMPLEMENTATION
The system is developed with small programs known as units. The input for these programs
are taken from the previous phrase. All these units are integrated together at the later stage.
Each unit is developed and tested separately in order to check the function of each unit and this
type of testing is known as unit testing.
4.1.6 MAINTENANCE
This is the final step and it occurs after installation of the product. In this phrase modifications
are made to the system in order to improve system performance.
CHAPTER 5
METHODOLOGY
5.1 IMPLEMENTATION OF THE CODE
Importing libraries:
Data pre-processing:-
Data pre-processing, a component of data preparation, describes any type of processing
performed on raw data to prepare it for another data processing procedure. It has traditionally
been an important preliminary step for the data mining process
1.Information elements collated on a number of individuals, typically used for the purposes of
making comparisons or identifying patterns
2. not null is a pandas function that will examine one or multiple values to validate that they
are not null.
3. The dropna() method removes the rows that contains NULL values.
4. The fillna() method replaces the NULL values with a specified value
5. Interpolate() function is basically used to fill NA values in the data frame or series.
2. confusion matrix:
The confusion matrix is a two-dimensional array that compares the anticipated and actual
category labels. These are the True Positive, True Negative, False Positive, and False Negative
classification categories for binary classification.
3.Plotting the graph Using seaborn library and finding the correlation
DATA SPLITTING:
TRAINING AND TESTING DATA:
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split,
one part is used to evaluate or test the data and the other to train the model.
1. The training set is the portion of data used to train the model. The model should observe and
learn from the training set, optimizing any of its parameters
2. The testing set is the portion of data that is tested in the final model and is compared against
the previous sets of data. The testing set acts as an evaluation of the final mode and algorithm.
5.3 RESULTS
5.3.1. logistic regression
RESULTS
FINDING ALL MODELS SCORES AND ACCURACY
CONCLUSION :
report dataset can not only be classified with the previously mentioned algorithms from machi
ne learning, there are many algorithms and techniques which may perform better than these. P
roduction of accurate classifier which perform efficiently for medicinal application is the mai
n challenge we face in machine learning. Four main algorithms were implemented in this Syst
em were Naïve Bayes Algorithm, Decision Tree Algorithm, KNN, Random Forest Classifier,
Logical Regression and SVM Algorithm. Our main aim for the research is to discover the alg
orithm which performs faster, accurate and efficiently. Random forest surpasses all the other a
lgorithms with an accuracy of 11.311053984575%.Thus I Conclude, this project by saying Ra
ndom forest Classification algorithm is best and better for handling this type of data set. In the
future, the designed system with the used machine learning classification algorithm can be us
ed to predict or diagnose other price prediction . The work can be extended or improved for th
e automation of bitcoin price analysis including some other machine learning algorithms
Future Scope
• To work on a better User Interface so that people can access these data easily and effortlessly.
• Implementing IOT model for smart automatic analysis.
• Implementing more algorithms to find out the best method for predicting the crypto currency
REFERENCES
1) Sin E, Wang L. Bitcoin price prediction using ensembles of neural networks. In: 2017 13th
International conference on natural computation, fuzzy systems and knowledge discovery.
IEEE.2017;p. 666–671. doi:10.1109/FSKD.2017.8393351.
2) Shikhara A, Singh AK, Nagaya S, Saini PK. Bitcoin Price Alert and Prediction System using
various Models. IOP Conference Series: Materials Science and Engineering.
2021;1131(1):012009. Available from: https://dx.doi.org/10.1088/1757-899x/1131/1/012009.
3) Mittal R, Arora S, Bhatia MP. Automated cryptocurrencies prices prediction using machine
learning. ICTACT Journal on Soft Computing. 2018;8
(4):1758– 1761. Available from:
http://ictactjournals.in/paper/IJSC_Vol_8_Iss_4_Paper_8_1758_1761.pdf. 4) Nakamoto S.
Bitcoin: A peer-to-peer electronic cash system. 2019. Available from:
https://git.dhimmel.com/bitcoin-whitepaper/.
5) Sebastião H, Godinho P. Forecasting and trading cryptocurrencies with machine learning
under changing market conditions. Financial Innovation. 2021;7(1):1–30. Available from:
https://dx.doi.org/10.1186/s40854-020-00217-x.
6) Parasailed T, Nonmodal T. Machine learning models comparison for bitcoin price
prediction. 10th International Conference on Information Technology and Electrical
Engineering. 2018;p. 506–511. Available from: 10.1109/ICITEED.2018.8534911.
7) Jaquart P, Dann D, Weinhardt C. Short-term bitcoin market prediction via machine learning.
The Journal of Finance and Data Science. 2021;7:45–66. Available from:
10.1016/j.jfds.2021.03.001.
8) Rane PV, Dhage SN. Systematic erudition of bitcoin price prediction using machine learning
techniques. 5th International Conference on Advanced Computing & Communication Systems
(ICACCS). 2019;p. 594–598. Available from: 10.1109/ICACCS.2019.8728424.
9) Roy R, Roy S, Hossain MN, Allam MZ, Nazmul N. Study on nonlinear partial differential
equation by implementing MSE method. Global Scientific Journals. 2020;8(1):1651–1665.
10) McKinny W. Pandas: a foundational Python library for data analysis and statistics. Python
for High Performance and Scientific Computing. 2011;14(9):1–9. Available from:
https://www.dlr.de/sc/portaldata/15/resources/dokumente/pyhpc2011/submissions/pyhpc2011
_submission_9.pdf.