You are on page 1of 10

71762108005 21AD46

COIMBATORE INSTITUTE OF TECHNOLOGY


COIMBATORE – 641014

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


FOUNDATIONS IN DATA SCIENCE
TUTORIAL – 2

SUBMITTED BY

T. BALA SAATVIK

71762108005

1
71762108005 21AD46

DESCRIPTION:

The Australian Credit Dataset consists of 20 variables that describe the demographic and
socio-economic characteristics of 1000 loan applicants and one outcome variable that
indicates whether the applicants are a “good credit risk” (i.e. likely to repay the loan) or a
“bad credit risk” (i.e. unlikely to repay the loan). A predictive model, developed based on
this dataset, is expected to provide guidance for a bank manager to decide whether to
approve a loan based on the profile of a loan applicant.

PROBLEM STATEMENT:

In Phases 2 and 3 of the Data Analytics Lifecycle, the data science team assess the
quality of the dataset, learns the relationships between variables and subsequently selects
key variables and the most suitable model based on the goal of a project. Use the
following steps to prepare the datasets for model building.

1. Perform an exploratory analysis on the AUS_CREDIT dataset. Show your analysis in


the spreadsheet name “EXPLORATORY”.

2. Discuss the relationship between the predictor variables and the outcome variable. Are
there any surprises in the data?

3. Select the key variables based on your analysis in Questions 3 and 4 for the prediction
of the creditability of loan applicants.

4. Identify an analytical model (such as regression, classification, clustering, etc.) suitable


for solving the creditability prediction problem. Defend your answer with an explanation.

SOLUTION:

Exploratory Data Analysis

i) Load the Data:

2
71762108005 21AD46

OUTPUT:

ii) Basic description about data:

OUTPUT:

3
71762108005 21AD46

iii) Find number of duplicate values:

OUTPUT: 0. This means, there is not a single duplicate value present in our dataset.

iv) Find number of unique values:

OUTPUT:

4
71762108005 21AD46

v) Check for missing values:

OUTPUT:

There are no missing values in the dataset. So data cleaning is not required.

vi) Visualizing the count of unique values of Age attribute:

5
71762108005 21AD46

OUTPUT:

vii) EDA Univariate Analysis:

OUTPUT:

6
71762108005 21AD46

viii) Statistical Analysis:

OUTPUT:

7
71762108005 21AD46

ix) EDA Multivariate Analysis:

OUTPUT:

Relationship between the Predictor Variables and the Outcome Variable:

• Applicants with a good Account Status are more likely to be approved for a loan.

• Applicants with a Savings Account are more likely to be approved for a loan.

• Applicants who are employed full-time are more likely to be approved for a loan.

• Applicants with a Real Estate background are more likely to be approved for a loan.

8
71762108005 21AD46

• Applicants with a proper Credit History are more likely to be approved for a loan.

• Younger applicants are less likely to be approved for a loan.

Surprises in Data:

• There were a few surprises in the data. One notable surprise was that the payment
status of previous credit variable, which indicates how well the applicant has paid
their previous credit, was not as strongly correlated with the outcome variable as
expected. This suggests that other factors, such as current income and employment
status, may be more important in determining creditworthiness than past payment
history.

• Another surprise was that the age variable had a relatively weak correlation with the
outcome variable. This is surprising because age is often considered an important
factor in determining creditworthiness, as older applicants are generally considered
more financially stable and reliable.

• Finally, the purpose of credit variable had a relatively high number of unique values,
indicating that loan applicants have a wide range of reasons for seeking credit. This
suggests that the bank may need to consider a variety of factors when evaluating
loan applications, rather than relying solely on traditional creditworthiness
indicators.

Key Variables:

Based on the exploratory analysis, we can select the following key variables for the
prediction of the creditability of loan applicants:
1. Credit amount
2. Credit Duration
3. Employment Length
4. Age
5. Credit History
6. Credit Purpose
7. Savings Account
8. Installment Rate

These variables were chosen based on their significant correlation with the outcome
variable and their relevance in determining creditworthiness.

9
71762108005 21AD46

Analytical Model:

Under Classification, Decision Tree is the best analytical model.

OUTPUT: Accuracy: 0.684

The decision tree model is a suitable choice for the Aus_Credit dataset for several
reasons:

1. Interpretability: Decision trees are easy to interpret and understand, which makes
them suitable for explaining the factors that contribute to the decision-making
process.

2. Non-parametric: Decision trees are a non-parametric method, which means that


they do not require any assumptions about the underlying data distribution.

3. Handling of non-linear relationships: Decision trees can handle non-linear


relationships between the predictor variables and the outcome variable, which can
be present in real-world datasets.

4. Handling of missing values: Decision trees can handle missing values without
imputing them, which can save time and effort in the data preparation process.

5. Flexibility: Decision trees can be used for both classification and regression
problems, making them a versatile model that can be applied to various types of
datasets.

Overall, decision trees provide a simple yet powerful way to model complex relationships
between variables and are a popular choice for predictive modeling in data science.

10

You might also like