DS Report 2

71762108005 21AD46
COIMBATORE INSTITUTE OF TECHNOLOGY

COIMBATORE – 641014
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

FOUNDATIONS IN DATA SCIENCE
TUTORIAL – 2
SUBMITTED BY
T. BALA SAATVIK
71762108005
1
71762108005 21AD46
DESCRIPTION:
The Australian Credit Dataset consists of 20 variables that describe the demographic and
socio-economic characteristics of 1000 loan applicants and one outcome variable that
indicates whether the applicants are a “good credit risk” (i.e. likely to repay the loan) or a
“bad credit risk” (i.e. unlikely to repay the loan). A predictive model, developed based on
this dataset, is expected to provide guidance for a bank manager to decide whether to
approve a loan based on the profile of a loan applicant.
PROBLEM STATEMENT:
In Phases 2 and 3 of the Data Analytics Lifecycle, the data science team assess the
quality of the dataset, learns the relationships between variables and subsequently selects
key variables and the most suitable model based on the goal of a project. Use the
following steps to prepare the datasets for model building.
1. Perform an exploratory analysis on the AUS_CREDIT dataset. Show your analysis in

the spreadsheet name “EXPLORATORY”.
2. Discuss the relationship between the predictor variables and the outcome variable. Are
there any surprises in the data?
3. Select the key variables based on your analysis in Questions 3 and 4 for the prediction
of the creditability of loan applicants.
4. Identify an analytical model (such as regression, classification, clustering, etc.) suitable

for solving the creditability prediction problem. Defend your answer with an explanation.
SOLUTION:
Exploratory Data Analysis
i) Load the Data:
2
71762108005 21AD46
OUTPUT:
ii) Basic description about data:
OUTPUT:
3
71762108005 21AD46
iii) Find number of duplicate values:
OUTPUT: 0. This means, there is not a single duplicate value present in our dataset.
iv) Find number of unique values:
OUTPUT:
4
71762108005 21AD46
v) Check for missing values:
OUTPUT:
There are no missing values in the dataset. So data cleaning is not required.
vi) Visualizing the count of unique values of Age attribute:
5
71762108005 21AD46
OUTPUT:
vii) EDA Univariate Analysis:
OUTPUT:
6
71762108005 21AD46
viii) Statistical Analysis:
OUTPUT:
7
71762108005 21AD46
ix) EDA Multivariate Analysis:
OUTPUT:
Relationship between the Predictor Variables and the Outcome Variable:
• Applicants with a good Account Status are more likely to be approved for a loan.
• Applicants with a Savings Account are more likely to be approved for a loan.
• Applicants who are employed full-time are more likely to be approved for a loan.
• Applicants with a Real Estate background are more likely to be approved for a loan.
8
71762108005 21AD46
• Applicants with a proper Credit History are more likely to be approved for a loan.
• Younger applicants are less likely to be approved for a loan.
Surprises in Data:
• There were a few surprises in the data. One notable surprise was that the payment
status of previous credit variable, which indicates how well the applicant has paid
their previous credit, was not as strongly correlated with the outcome variable as
expected. This suggests that other factors, such as current income and employment
status, may be more important in determining creditworthiness than past payment
history.
• Another surprise was that the age variable had a relatively weak correlation with the
outcome variable. This is surprising because age is often considered an important
factor in determining creditworthiness, as older applicants are generally considered
more financially stable and reliable.
• Finally, the purpose of credit variable had a relatively high number of unique values,
indicating that loan applicants have a wide range of reasons for seeking credit. This
suggests that the bank may need to consider a variety of factors when evaluating
loan applications, rather than relying solely on traditional creditworthiness
indicators.
Key Variables:
Based on the exploratory analysis, we can select the following key variables for the
prediction of the creditability of loan applicants:
1. Credit amount
2. Credit Duration
3. Employment Length
4. Age
5. Credit History
6. Credit Purpose
7. Savings Account
8. Installment Rate
These variables were chosen based on their significant correlation with the outcome
variable and their relevance in determining creditworthiness.
9
71762108005 21AD46
Analytical Model:
Under Classification, Decision Tree is the best analytical model.
OUTPUT: Accuracy: 0.684
The decision tree model is a suitable choice for the Aus_Credit dataset for several
reasons:
1. Interpretability: Decision trees are easy to interpret and understand, which makes
them suitable for explaining the factors that contribute to the decision-making
process.
2. Non-parametric: Decision trees are a non-parametric method, which means that

they do not require any assumptions about the underlying data distribution.
3. Handling of non-linear relationships: Decision trees can handle non-linear

relationships between the predictor variables and the outcome variable, which can
be present in real-world datasets.
4. Handling of missing values: Decision trees can handle missing values without
imputing them, which can save time and effort in the data preparation process.
5. Flexibility: Decision trees can be used for both classification and regression
problems, making them a versatile model that can be applied to various types of
datasets.
Overall, decision trees provide a simple yet powerful way to model complex relationships
between variables and are a popular choice for predictive modeling in data science.
10

DS Report 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Report 2

Uploaded by

Copyright:

Available Formats

71762108005 21AD46

COIMBATORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

1. Perform an exploratory analysis on the AUS_CREDIT dataset. Show your analysis in

4. Identify an analytical model (such as regression, classification, clustering, etc.) suitable

Exploratory Data Analysis

i) Load the Data:

ii) Basic description about data:

iii) Find number of duplicate values:

iv) Find number of unique values:

v) Check for missing values:

vi) Visualizing the count of unique values of Age attribute:

vii) EDA Univariate Analysis:

viii) Statistical Analysis:

ix) EDA Multivariate Analysis:

Relationship between the Predictor Variables and the Outcome Variable:

• Younger applicants are less likely to be approved for a loan.

Under Classification, Decision Tree is the best analytical model.

OUTPUT: Accuracy: 0.684

2. Non-parametric: Decision trees are a non-parametric method, which means that

3. Handling of non-linear relationships: Decision trees can handle non-linear

You might also like