You are on page 1of 9

Report on Data Mining, Predictive Analytics, and Machine Learning

1. Introduction

In this report, we explore the application of data mining, predictive analytics, and machine
learning techniques to identify risky customers and predict potential defaults in a banking
context. The goal is to build models that can differentiate between good and bad credit-standing
customers based on historical data and attributes.

1. Data Preprocessing

We began by preprocessing the dataset to ensure the quality and suitability of the data for
model training. Missing values were addressed through imputation techniques, using means
and medians where appropriate. Outliers were identified and treated through methods like
winsorization to avoid undue influence on the models. Additionally, features were standardized
to ensure uniform scales across variables, aiding the performance of our models.

1. Model Development

We developed two primary models: a Decision Tree and a Neural Network.

3.1 Decision Tree Model

The Decision Tree model is a supervised learning algorithm that splits the data into subsets
based on the most significant attribute at each node. We designed a Decision Tree with a
maximum depth of 3, aiming to prevent overfitting while capturing important patterns in the
data.

3.2 Neural Network Model

Our Neural Network model is a multilayer perceptron with two hidden layers, containing 64 and
32 units respectively, activated by ReLU functions. The output layer employs a sigmoid activation
to predict the likelihood of a customer being risky. The model was trained using the Adam
optimizer and binary cross-entropy loss.

1. Model Performance

Both models were evaluated on a separate testing set, and the performance metrics were
calculated.

question 1

a)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score

# Load the dataset from Excel


data = pd.read_excel('CreditRisk.xlsx')
# Preprocess categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data, columns=['CheckingAcct',
'CreditHist', 'Purpose', 'SavingsAcct', 'Employment', 'Gender',
'Personal Status',
'Housing', 'Job', 'Telephone', 'Foreign'])

# Separate features (X) and target (y)


X = data_encoded.drop('CreditStanding', axis=1)
y = data_encoded['CreditStanding']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=25, random_state=42)

# Create a Decision Tree model


decision_tree_model = DecisionTreeClassifier(max_depth=3) # You can
adjust max_depth as needed

# Fit the model to the training data


decision_tree_model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = decision_tree_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='Bad')
recall = recall_score(y_test, y_pred, pos_label='Bad')
f1 = f1_score(y_test, y_pred, pos_label='Bad')

# Print the metrics


print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

Accuracy: 0.68
Precision: 0.71
Recall: 0.71
F1-score: 0.71

b)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense
# Load the dataset from Excel
data = pd.read_excel('CreditRisk.xlsx')

# Preprocess categorical variables using label encoding


label_encoder = LabelEncoder()
data_encoded = data.apply(label_encoder.fit_transform)

# Separate features (X) and target (y)


X = data_encoded.drop('CreditStanding', axis=1)
y = data_encoded['CreditStanding']

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=25, random_state=42)

# Create a Neural Network model


model = Sequential()
model.add(Dense(units=64, activation='relu',
input_dim=X_train.shape[1]))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model


model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])

# Train the model


model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1)

Epoch 1/50
13/13 [==============================] - 1s 2ms/step - loss: 0.6907 -
accuracy: 0.5475
Epoch 2/50
13/13 [==============================] - 0s 2ms/step - loss: 0.6340 -
accuracy: 0.6550
Epoch 3/50
13/13 [==============================] - 0s 2ms/step - loss: 0.6008 -
accuracy: 0.6975
Epoch 4/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5784 -
accuracy: 0.7150
Epoch 5/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5583 -
accuracy: 0.7275
Epoch 6/50
13/13 [==============================] - 0s 1ms/step - loss: 0.5453 -
accuracy: 0.7300
Epoch 7/50
13/13 [==============================] - 0s 1ms/step - loss: 0.5339 -
accuracy: 0.7400
Epoch 8/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5248 -
accuracy: 0.7450
Epoch 9/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5158 -
accuracy: 0.7575
Epoch 10/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5090 -
accuracy: 0.7625
Epoch 11/50
13/13 [==============================] - 0s 2ms/step - loss: 0.5034 -
accuracy: 0.7600
Epoch 12/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4955 -
accuracy: 0.7625
Epoch 13/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4890 -
accuracy: 0.7650
Epoch 14/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4832 -
accuracy: 0.7750
Epoch 15/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4779 -
accuracy: 0.7800
Epoch 16/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4707 -
accuracy: 0.7825
Epoch 17/50
13/13 [==============================] - 0s 1ms/step - loss: 0.4646 -
accuracy: 0.7900
Epoch 18/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4583 -
accuracy: 0.7900
Epoch 19/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4528 -
accuracy: 0.7900
Epoch 20/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4476 -
accuracy: 0.8075
Epoch 21/50
13/13 [==============================] - 0s 1ms/step - loss: 0.4412 -
accuracy: 0.8050
Epoch 22/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4354 -
accuracy: 0.8125
Epoch 23/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4292 -
accuracy: 0.8150
Epoch 24/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4255 -
accuracy: 0.8250
Epoch 25/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4171 -
accuracy: 0.8200
Epoch 26/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4114 -
accuracy: 0.8200
Epoch 27/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4059 -
accuracy: 0.8325
Epoch 28/50
13/13 [==============================] - 0s 2ms/step - loss: 0.4007 -
accuracy: 0.8375
Epoch 29/50
13/13 [==============================] - 0s 3ms/step - loss: 0.3954 -
accuracy: 0.8325
Epoch 30/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3948 -
accuracy: 0.8475
Epoch 31/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3820 -
accuracy: 0.8450
Epoch 32/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3787 -
accuracy: 0.8425
Epoch 33/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3721 -
accuracy: 0.8500
Epoch 34/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3675 -
accuracy: 0.8475
Epoch 35/50
13/13 [==============================] - 0s 1ms/step - loss: 0.3628 -
accuracy: 0.8600
Epoch 36/50
13/13 [==============================] - 0s 1ms/step - loss: 0.3566 -
accuracy: 0.8600
Epoch 37/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3505 -
accuracy: 0.8525
Epoch 38/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3456 -
accuracy: 0.8625
Epoch 39/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3405 -
accuracy: 0.8650
Epoch 40/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3367 -
accuracy: 0.8675
Epoch 41/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3314 -
accuracy: 0.8725
Epoch 42/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3235 -
accuracy: 0.8750
Epoch 43/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3212 -
accuracy: 0.8775
Epoch 44/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3137 -
accuracy: 0.8800
Epoch 45/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3099 -
accuracy: 0.8850
Epoch 46/50
13/13 [==============================] - 0s 1ms/step - loss: 0.3073 -
accuracy: 0.8875
Epoch 47/50
13/13 [==============================] - 0s 2ms/step - loss: 0.3035 -
accuracy: 0.8800
Epoch 48/50
13/13 [==============================] - 0s 2ms/step - loss: 0.2949 -
accuracy: 0.8975
Epoch 49/50
13/13 [==============================] - 0s 2ms/step - loss: 0.2900 -
accuracy: 0.8950
Epoch 50/50
13/13 [==============================] - 0s 2ms/step - loss: 0.2848 -
accuracy: 0.8975

<keras.callbacks.History at 0x7d4ba0d0c760>

from sklearn.metrics import accuracy_score, precision_score,


recall_score, f1_score

# Predictions from the trained model


y_pred = (model.predict(X_test) > 0.5).astype('int')

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# Print the metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

1/1 [==============================] - 0s 48ms/step


Accuracy: 0.68
Precision: 0.60
Recall: 0.82
F1-score: 0.69

c) Cross-Validation:

from keras.wrappers.scikit_learn import KerasClassifier


from sklearn.model_selection import cross_val_score

# Function to create a Keras model


def create_model():
model = Sequential()
model.add(Dense(units=64, activation='relu',
input_dim=X_train.shape[1]))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
return model

# Create a KerasClassifier
keras_classifier = KerasClassifier(build_fn=create_model, epochs=50,
batch_size=32, verbose=0)

# Cross-validation for Neural Network model


nn_scores = cross_val_score(keras_classifier, X_scaled, y, cv=10)
print("Neural Network Cross-Validation Scores:", nn_scores)

<ipython-input-18-821ab814bf16>:14: DeprecationWarning:
KerasClassifier is deprecated, use Sci-Keras
(https://github.com/adriangb/scikeras) instead. See
https://www.adriangb.com/scikeras/stable/migration.html for help
migrating.
keras_classifier = KerasClassifier(build_fn=create_model, epochs=50,
batch_size=32, verbose=0)
WARNING:tensorflow:5 out of the last 9 calls to <function
Model.make_test_function.<locals>.test_function at 0x7d4b8ec56680>
triggered tf.function retracing. Tracing is expensive and the
excessive number of tracings could be due to (1) creating @tf.function
repeatedly in a loop, (2) passing tensors with different shapes, (3)
passing Python objects instead of tensors. For (1), please define your
@tf.function outside of the loop. For (2), @tf.function has
reduce_retracing=True option that can avoid unnecessary retracing. For
(3), please refer to
https://www.tensorflow.org/guide/function#controlling_retracing and
https://www.tensorflow.org/api_docs/python/tf/function for more
details.
WARNING:tensorflow:6 out of the last 11 calls to <function
Model.make_test_function.<locals>.test_function at 0x7d4b909ab9a0>
triggered tf.function retracing. Tracing is expensive and the
excessive number of tracings could be due to (1) creating @tf.function
repeatedly in a loop, (2) passing tensors with different shapes, (3)
passing Python objects instead of tensors. For (1), please define your
@tf.function outside of the loop. For (2), @tf.function has
reduce_retracing=True option that can avoid unnecessary retracing. For
(3), please refer to
https://www.tensorflow.org/guide/function#controlling_retracing and
https://www.tensorflow.org/api_docs/python/tf/function for more
details.

Neural Network Cross-Validation Scores: [0.60465115 0.79069769


0.72093022 0.67441863 0.79069769 0.59523809
0.64285713 0.71428573 0.71428573 0.66666669]

d)

1. Feature Engineering:

To identify irrelevant or redundant features, you can use techniques like analyzing feature
importance from the Decision Tree model or performing correlation analysis. You can remove
features that don't contribute much to the model's predictive power.

1. Data Preprocessing:

a) Handling Missing Values: Depending on the extent of missing values, you might choose to
impute them (fill them in) using techniques like mean, median, or advanced imputation
methods.

b) Outlier Handling: Identify and handle outliers by either removing them or transforming them
using techniques like winsorization or log transformation.

c) Normalization/Standardization: Apply normalization (scaling features to a range) or


standardization (centering features around mean and scaling to unit variance) to ensure that
features are on similar scales, which can help some algorithms perform better.

1. Hyperparameter Tuning:

You can use techniques like grid search or random search to find the optimal hyperparameters
for both the Decision Tree and Neural Network models. For example, you can experiment with
different values of max_depth for the Decision Tree and different architectures, learning rates,
and dropout rates for the Neural Network.
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune


param_grid = {
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5, 10]
}

# Create GridSearchCV object


grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid,
cv=10)

# Fit the grid search to the training data


grid_search.fit(X_train, y_train)

# Get the best hyperparameters


best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'max_depth': 5, 'min_samples_split': 2}

1. Ensemble Methods:

For the Decision Tree model, you can experiment with ensemble methods like Random Forest,
which combines multiple decision trees to improve predictive accuracy and control overfitting.
For the Neural Network model, you can experiment with different architectures, activation
functions, and regularization techniques (such as dropout or L2 regularization) to create
ensembles of neural networks

You might also like