Professional Documents
Culture Documents
Kritik Bansal - 1802911018 - Report PDF
Kritik Bansal - 1802911018 - Report PDF
BACHELOR OF TECHNOLOGY
IN
Submitted by
KRITIK BANSAL
1802911018
PART-A
1. Introduction
2. Project Overview
3. Implementation (with snap shots)
4. Conclusion
5. Daily Log
PART- B
1. Summary of Paper-1
2. Summary of Paper-2
3. Summary of Paper-3
4. Summary of Paper-4
5. Summary of Paper-5
PART –C
1. Certificate of MOOC-1
2. Certificate of MOOC-2
CERTIFICATE
PART - A
Introduction
'Fraud' in credit card transactions is unauthorized and unwanted usage of an account by someone
other than the owner of that account. Necessary prevention measures can be taken to stop this abuse
and the behavior of such fraudulent practices can be studied to minimize it and protect against
similar occurrences in the future. In other words, Credit Card Fraud can be defined as a case where a
person uses someone else’s credit card for personal reasons while the owner and the card issuing
authorities are unaware of the fact that the card is being used. Fraud detection involves monitoring
the activities of populations of users in order to estimate, perceive or avoid objectionable behavior,
which consist of fraud, intrusion, and defaulting. This is a very relevant problem that demands the
attention of communities such as machine learning and data science where the solution to this
problem can be automated. This problem is particularly challenging from the perspective of learning,
as it is characterized by various factors such as class imbalance. The number of valid transactions far
outnumber fraudulent ones. Also, the transaction patterns often change their statistical properties
over the course of time.
These are not the only challenges in the implementation of a real-world fraud detection system,
however. In real world examples, the massive stream of payment requests is quickly scanned by
automatic tools that determine which transactions to authorize. Machine learning algorithms are
employed to analyses all the authorized transactions and report the suspicious ones. These reports are
investigated by professionals who contact the cardholders to confirm if the transaction was genuine
or fraudulent. The investigators provide a feedback to the automated system which is used to train
and update the algorithm to eventually improve the fraud-detection performance over time.
Some of the currently used approaches to detection of such fraud are:
The objective of this project is to successfully identify the Fraudulent transactions out of the
about 2L transactions from the data set available on Kaggle.
SOFTWARE USED – R
The datasets contain transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, … V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it
takes value 1 in case of fraud and 0 otherwise.
IMPLEMENTATION
Converting Class integer into factor and printing the summary of the
dataset
Counting the missing values
The code prints out the number of false positives it detected and compares it with the actual
values. This is used to calculate the accuracy score and precision of the algorithms. The
fraction of data we used for faster testing is 20% of the entire dataset. The complete dataset is
also used at the end and both the results are printed. These results along with the
classification report for each algorithm is given in the output as follows, where class 0 means
the transaction was determined to be valid and 1 means it was determined as a fraud
transaction. This result matched against the class values to check for false positives. Since the
entire dataset consists of only two days’ transaction records, it’s only a fraction of data that
can be made available if this project were to be used on a commercial scale. Being based on
machine learning algorithms, the program will only increase its efficiency over time as more
data is put into it.
DAILY LOG
Name of Student Kritik Bansal
20 Multiple linear regression with anova test and predictions on this 08.07.2020
dataset.
21 One-way anova, two-way anova, anova with replicates and anova 09.07.2020
without replicates.
22 Extreme Gradient Boosting Algorithm. 10.07.2020
29 ARIMA model in time series algo and plotting ACF and PACF 27.07.2020
plots.
30 Capturing the market annualized volatility using Apple stock 28.07.2020
price data
PART-B
Research Paper-1
Summary-
Summary-
Research Paper -3
Title- Performance of Some Factor Analysis Techniques
Summary-
By way of orientation, the first section of this paper will present a brief
conceptual review of factor analysis. In the second section the scientific context
of the method will be discussed. The major uses of factor analysis will be listed
and its relation to induction and deduction, description and inference, causation
and explanation, and classification and theory will be considered. To aid
understanding, the third section will outline the geometrical and algebraic factor
models, and the fourth section will define the factor matrices and their
elements--the vehicles for presenting factor results. Since comprehending factor
rotation is important for interpreting the findings, the fifth and final section is
devoted to clarifying its significance.
Research Paper -4
Title- Dependent Ranked Set Sampling Designs for Parametric Estimation
with Applications
Summary-
We derive the likelihood function of the neoteric ranked set sampling (NRSS)
as dependent in sampling method and double neoteric ranked set sampling
(DNRSS) designs as combine between independent sampling method in the first
stage and dependent sampling method in the second stage and they compared
for the estimation of the parameters of the inverse Weibull (IW) distribution. An
intensive simulation has been made to compare the one and the two stages
designs. The results showed that likelihood estimation based on ranked set
sampling (RSS) as independent sampling method, NRSS and DNRSS designs
provide more efficient estimators than the usual simple random sampling
design. Moreover, the DNRSS is slightly more efficient than the NRSS and RSS
designs in the case of estimating the IW distribution parameters. Ranked set
sampling (RSS) is an advanced data collection method when the exact
measurement of an observation is difficult and/or expensive used in a number of
research areas, e.g., environment, bioinformatics, ecology, etc. In this method,
random sets are drawn from a population and the units in sets are ranked with a
ranking mechanism which is based on a visual inspection or a concomitant
variable. Because of the importance of working with a good design and easy
analysis, there is a need for a software tool which provides sampling designs
and statistical inferences based on RSS and its modifications.
Research Paper-5
Summary-