You are on page 1of 11

M818A: Machine Learning and Cyber Security-A

Tutor-Marked Assignment, Fall 2021 2022

Cut-Off Date: 28-Nov-21 Total Marks: 20

Plagiarism Warning:
As per AOU rules and regulations, all students are required to submit their own TMA work and
avoid plagiarism. The AOU has implemented sophisticated techniques for plagiarism detection.
You must provide all references in case you use and quote another person's work in your TMA.
You will be penalized for any act of plagiarism as per the AOU's rules and regulations.

Declaration of No Plagiarism by Student (to be signed and submitted by student with TMA
work):

I hereby declare that this submitted TMA work is a result of my own efforts and I have not
plagiarized any other person's work. I have provided all references of information that I have
used and quoted in my TMA work.

Name of Student:………………………………

Signature:………………………………………

1
M818A – TMA – Fall 21-22
Part I: Decision tree and association rule mining
A bank is interested to know which customers are engaged in a credit card fraud activity. To
address this goal, they decided to use machine learning techniques. They collected data about
customers and store it inside an Excel file. The data is available in the attached file
(fraud_ds.csv, size = 340MB).
You are required to do the following tasks:

A. Pre-processing (2 points)
1. Perform any necessary preprocessing step. Identify the attributes that need to be
numerical and report any analysis that allows you to choose the right categorical to
numerical conversion. (2 pts)
First we need to import all the required libraries for the pre processing phase along with the
dataset needed for the exam:

Now we are going to drop all the unnecessary columns from our dataset, since the data is
not relevant to the objective of our goal which is fraud detection:

2
M818A – TMA – Fall 21-22
“DOB” preprocessing we converted the date of birth format into age in years

Also we will create 3 columns from the “trans_date_trans_time” column, these columns are:

Day-Month-Year

Now I will bin the dob into 3 bins (YOUNG - ADULT - OLD) as follows:
I have chosen the minimum age to be 17 and max 97 after running the code:

So in order to split to 3 bins I will divide into the 3 bins as such:

17 to 30 Young / 30 to 65 Adult / 65 to 97 Old


3
M818A – TMA – Fall 21-22
Now we will preprocess the City Population and split them into bins as follows:

I have decided to divide the population metric into 2 bins above and below average
We figure out the average by using the mean function

4
M818A – TMA – Fall 21-22
Amt preprocessing

#In order to achieve high accuracy I have divided amt into 4 bins
# Low_Amt 1 to 20
# MediumLow_Amt 20 to 210
# MediumHigh_Amt 210 to 500
# High_Amt 500 to 4248
Min and max numbers were chosen by running the code as shown below min() , max()
functions and the splitting point was chosen by using the mean() function

Preprocessing of the “trans_date_trans_time” metric by converting the “08/03/2020 6:18”


format to 24hrs format
# trans_time we will create by 4 bins:
# After Midnight 0 to 5am
# Morning 5am to 11am
# Noon 11am to 1pm
# Afternoon 1pm tp 6pm

5
M818A – TMA – Fall 21-22
# Night 6pm to 11pm

#Bins will be divided as the following

Finally we will check if there are any Null values by running the following code:

6
M818A – TMA – Fall 21-22
B. Machine Learning (18 points)

In this part, you will build a classification model using decision trees. You will use the
preprocessed dataset from part A, question1 in order to build a classification model. The
model will then be applied to a new set of prospects to whom the bank managers may want
to quit fraud customer cards.

Your tasks in this problem are the following:

1. Load the data preprocessed in part I, question 1. Preprocess again by encoding


categorical attributes into numeric. (1 pt)

For this question we will convert the categorical attributes to numerical values using
LabelEncoder all the labes and for the class we will use LabelBinarizer

7
M818A – TMA – Fall 21-22
2. Split the transformed data into training and test sets (using 80%-20% randomized split).
(1 pt)
By using randomized split 80% - 20% we will split the data into training and test sets

8
M818A – TMA – Fall 21-22
as following:

3. Use scikit-learn to build a decision tree (using default parameters). Compare the
performance scores on the test and the training data sets. (1 pt) What does the comparison
tell you? (1 pt)

9
M818A – TMA – Fall 21-22
As we can see that the accuracy of the training set 1 and 0.999 on the testing set and looking at the
confusion matrix we can see that we have 1 false pos and 1 false neg. We conclude that there us no
overfitting nor underfitting. This means we have very good results for both the training and the
testing sets.

4. In the following, you will use 10-fold cross validation to determine the best depth of the
decision tree (the depth of the tree can be specified as input parameter. See documentation
at: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
).

First we need to import the library

To this end, you will perform cross validation on the training dataset created in part (2)
and leave the test data set apart for final evaluation.

4.1. Specify the number of depths you want to test. You may want to look at the

10
M818A – TMA – Fall 21-22
tree created in question 3 in order to make a choice. (1pt)

Lookin at the tree from q3, the number of depth I chose is 1 to 10

4.2. Iterate over the different depths and for each, perform 10-fold cross validation
with a decision tree having the corresponding depth (1pt)

4.3. Collect the mean accuracy score for each iteration (1pt)

4.4. Plot the obtained accuracy scores for different iterations and explain what
should be the optimal depth of the tree. (Optional as bonus) (1pt)

4.5. Build a decision tree with the depth selected above using the training
dataset. (1pt)

5. For each row in the test set, what is the probability of a user to be classified as fake
(fake= yes). Hint: use predict_proba. Check documentation at: https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html (3pts)

6. Convert the decision tree into an equivalent set of rules.(2pts)

Submission
Attach your code with the explanation (.docx format) for each question as follows:
Code: Fname_Lname_TMA_code.ipynb
Explanation : Fname_Lname_exp.docx

End of Questions

11
M818A – TMA – Fall 21-22

You might also like