You are on page 1of 4

Decision tree classifier

Dishant Kumar Yadav 2021BCS0136

Implementation:

General Terms: Let us first discuss a few statistical concepts used in this post.

Entropy: The entropy of a dataset, is a measure the impurity, of the dataset Entropy can also be
thought, as a measure of uncertainty. We should try to minimize, the Entropy. The goal of
machine learning models is to reduce uncertainty or entropy, as far as possible.

Information Gain: Information gain, is a measure of, how much information, a feature gives us
about the classes. Decision Trees algorithm, will always try, to maximize information gain.
Feature, that perfectly partitions the data, should give maximum information. A feature, with the
highest Information gain, will be used for split first.

keyboard_arrow_down Import Libraries:


We are going to import NumPy and the pandas library.

# Import the required libraries


import pandas as pd
import numpy as np

from google.colab import files

uploaded = files.upload()

Choose Files diabetes11.csv


diabetes11.csv(text/csv) - 7491 bytes, last modified: 17/1/2024 - 100% done
Saving diabetes11.csv to diabetes11.csv

import shutil

# Assuming the file name is 'diabetes.csv'


shutil.move('diabetes11.csv', '/content/diabetes11.csv')
'/content/diabetes11.csv'

import os

# List files in the /content directory


os.listdir('/content')

['.config',
'diabetes (1).csv',
'diabetes.csv',
'diabetes11.csv',
'sample_data']

import pandas as pd

# Read the CSV file into a DataFrame


df = pd.read_csv('/content/diabetes11.csv')

# Display the first few rows of the DataFrame


df.head()
1 to 5 of 5 entries Filter
index Glucose BloodPressure diabetes
0 148 72 1
1 85 66 0
2 183 64 1
3 89 66 0
4 137 40 1
Show 25 per page

Like what you see? Visit the data table notebook to learn more about interactive tables.

Distributions

2-d distributions

Time series

# Define the calculate entropy function


def calculate_entropy(df_label):
classes,class_counts = np.unique(df_label,return_counts = True)
entropy_value = np.sum([(-class_counts[i]/np.sum(class_counts))*np.log2(class_counts
for i in range(len(classes))])
return entropy_value
# Define the calculate information gain function
def calculate_information_gain(dataset,feature,label):
# Calculate the dataset entropy
dataset_entropy = calculate_entropy(dataset[label])
values,feat_counts= np.unique(dataset[feature],return_counts=True)

# Calculate the weighted feature entropy # Call the ca


weighted_feature_entropy = np.sum([(feat_counts[i]/np.sum(feat_counts))*calculate_ent
==values[i]).dropna()[label]) for i in range(len(values))]
feature_info_gain = dataset_entropy - weighted_feature_entropy
return feature_info_gain
# Set the features and label
features = df.columns[:-1]
label = 'diabetes'
parent=None
features

Index(['Glucose', 'BloodPressure'], dtype='object')

import numpy as np

def create_decision_tree(dataset, df, features, label, parent=None):


datum = np.unique(df[label], return_counts=True)
unique_data = np.unique(dataset[label])

if len(unique_data) <= 1:
return unique_data[0]

elif len(dataset) == 0:
return unique_data[np.argmax(datum[1])]

elif len(features) == 0:
return parent

else:
parent = unique_data[np.argmax(datum[1])]

# call the calculate_information_gain function


item_values = [calculate_information_gain(dataset, feature, label) for feature in
optimum_feature = features[np.argmax(item_values)]

You might also like