Professional Documents
Culture Documents
YES NO
YES NO YES NO
High Humidity
Played Golf Played Golf Played Golf Played Golf Outlook Temperature Humidity Wind Played
Yes No Yes No Yes No Yes No Sunny Hot High Weak No
4 0 0 1 1 0 0 4 Sunny Hot High Weak No
Rain Mild High Strong Yes
Sunny Mild High Weak No
Rain Mild High Weak No
Practice Program:1
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required File: play.csv download from slack or web server. Copy the
file location because it will be required to run the program.
• Required Modules: Numpy, Pandas, Scikit Learn etc.
Decision Tree Algorithm: Numeric Data
• The "Iris" dataset. Originally published at UCI Machine Learning Repository: Iris
Data Set, this small dataset from 1936 is often used for testing out machine
learning algorithms and visualizations (for example, Scatter Plot). Each row of the
table represents an iris flower, including its species and dimensions of its
botanical parts, sepal and petal, in centimeters.
• This dataset is built in Scikit Learn Module
• For splitting the dataset and to find the root node-
• First, we have to sort out the Classes and their occurrences. In Irish dataset
there are 3 Classes of Flowers Name Setosa, Versicolor and Virginica
respectively.
• Second, we have to sort a suitable feature in ascending order and find
average value for every two dataset. And also find the Gini Index for that
Average Value.
• For the lowest Gini Index Value we will split the dataset.
Decision Tree Algorithm: Numeric Data
For Example, The Irish Dataset is sorted according to Petal Length in Ascending order
and Average value of Petal Length has been taken every two length. From this we can
find-
Petal Length<=2.45 Petal Length<=3.15
cm cm
YES NO YES NO
Setosa Setosa Setosa Setosa
Yes No Yes No Yes No Yes No
50 0 0 100 50 1 0 99
Gini Index = 0 Gini Index = 0.013
So, the entire dataset will be splitted according to Petal Length <=2.45 cm & this should
be the root node. Similarly for the remaining dataset, the procedure is repeated
choosing the Petal Width Column.
Decision Tree Algorithm: Numeric Data
YES NO YES NO
Versicolor Versicolor Versicolor Versicolor
Yes No Yes No Yes No Yes No
49 5 1 45 50 16 0 34
Gini Index = 0.110 Gini Index = 0.24
So, Petal Width <=1.75 cm should be selected as the second internal node. For further
splitting we can choose Petal Length & repeat the process.
Decision Tree Algorithm: Numeric Data
The final decision tree could be like this-
Petal Length<=2.45
cm
YES NO
Setosa
Yes No Petal Width<=1.75 cm
50 0
YES NO
YES NO YES NO
Versicolor Versicolor Virginica Virginica
Yes No Yes No Yes No Yes No
47 1 2 4 2 1 0 43
Practice Program 2
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required Database: Iris dataset which is built in with Scikit Learn
module. For calculation purpose I attach the file in slack and web
server also.
• Required Modules: Numpy, Pandas, Scikit Learn etc.
Decision Tree Algorithm: Mixed Data
• For splitting the dataset and to find the root node-
• First, we have to sort out the Classes and their occurrences. And select the
best class to split the dataset.
• Second, we have to sort a suitable feature in ascending order and find
average value for every two dataset. And also find the Gini Index for that
Average Value.
• For the lowest Gini Index Value we will split the dataset.
• For subsequent branch nodes we will follow the either numeric data algorithm or
categorical data algorithm but have to decide by the value of Gini Index or
Information Gain from Entropy Value.
• We will decide the leaf node if there is just one class remain. Splitting of parent
nodes and Pruning on branch nodes depend on the value of Gini Index.
Practice Program 3
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required Database: download it from the link
https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/Cogni
tiveClass/ML0101ENv3/labs/drug200.csv
Required Modules: Numpy, Pandas, Scikit Learn etc.
Advantages & Disadvantages of Decision Tree
• Easy to Understand • Overfitting
• Useful in Data exploration • Not fit for continuous variables
• Implicitly perform feature selection • Variance can give different results
Disadvantages
• Little effort for data preparation of decision tree.
Advantages
No No No 125 No No No No 125 No
Yes Yes Yes 180 Yes Yes Yes Yes 180 Yes
Yes Yes No 210 No Out of Bag Yes No Yes 167 Yes
Yes No Yes 167 Yes Dataset Yes No Yes 167 Yes
Step 2: Create A Decision Tree Using The Bootstrapped Dataset using a random subset of variables at each step
Chest Good Blood Blocked Heart So we take the data and run the six trees in the random we
Weight
Pain Circulation Arteries Disease made. Running the six trees separately the following results
No No No 168 Yes
come.
Heart Disease
Yes No
5 1
Now suppose we take two out of bag datasets to compare.
Chest Good Blood Blocked Weight Heart Heart Disease The proportion of Out-of-Bag
Pain Circulation Arteries Disease Yes No
4 2 samples that are incorrectly
No No No 168 Yes
classified is the “Out-of-Bag
error”. We can measure the
Chest Good Blood Blocked Heart Heart Disease
Pain Circulation Arteries Weight Disease Yes No accuracy of the random forest by
5 1
No No No 125 No “Out-of-Bag error”
Advantages & Disadvantages of Decision Tree
• Produces Highly Accurate Classifier • Random forests have been
• Runs efficiently on large databases observed to overfit for some
• Estimates important variables in datasets with noisy
Disadvantages
classification classification/regression tasks.
Advantages