Professional Documents
Culture Documents
‐*‐ coding: utf‐8 ‐*‐
"""
Lab: Classification Trees
@author: Brian James
"""
# The purpose of this lab is to start getting some practice using Scikit‐Learn to
build classification trees.
# Instructions:
# 1. Explain why decision trees are non‐parametric models.
# 2. Create a Python file. Use this code to read and preprocess the data. (see
number 7 for how to install graphviz)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
import graphviz
df =
pd.read_csv('https://raw.githubusercontent.com/grbruns/cst383/master/College.csv',
index_col=0)
# 3. Convert the 'Private' column to an numeric column with values 0 and 1 (1
for private colleges).
operator = df[df['Private'] == 'Yes'].replace(regex='Yes', value=1).copy()
operator2 = df[df['Private'] == 'No'].replace(regex='No', value=0).copy()
frames = [operator, operator2]
df2 = pd.concat(frames).sort_index()
# 4. Do a little exploration of the data to remember what it’s like. E.g., use
df.info(), df.describe().
df2.describe()
# Out[109]:
# Private Apps ... Expend Grad.Rate
# count 777.000000 777.000000 ... 777.000000 777.00000
# mean 0.727156 3001.638353 ... 9660.171171 65.46332
# std 0.445708 3870.201484 ... 5221.768440 17.17771
# min 0.000000 81.000000 ... 3186.000000 10.00000
# 25% 0.000000 776.000000 ... 6751.000000 53.00000
# 50% 1.000000 1558.000000 ... 8377.000000 65.00000
# 75% 1.000000 3624.000000 ... 10830.000000 78.00000
# max 1.000000 48094.000000 ... 56233.000000 118.00000
# [8 rows x 18 columns]
df2.info
# Out[110]:
# <bound method DataFrame.info of Private Apps
... Expend Grad.Rate
# Abilene Christian University 1 1660 ... 7041 60
# Adelphi University 1 2186 ... 10527 56
# Adrian College 1 1428 ... 8735 54
# Agnes Scott College 1 417 ... 19016 59
# Alaska Pacific University 1 193 ... 10922 15
# ... ... ... ... ...
# Worcester State College 0 2197 ... 4469 40
# Xavier University 1 1959 ... 9189 83
# Xavier University of Louisiana 1 2097 ... 8323 49
# Yale University 1 10705 ... 40386 99
# York College of Pennsylvania 1 2989 ... 4509 99
# [777 rows x 18 columns]>
# 5. We will try to predict whether a college is public or private. Select
# a few predictors, create NumPy arrays X and y, and then do a training/test
# split. Try hard to remember how to do this from memory. If you can't,
# refer to the hints.
predictors = ['Outstate', 'F.Undergrad']
X = df[predictors].values
y = (df['Private'] == 'Yes').values.astype(int)
#test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=0)
# 6. Train a tree classifier using Scikit‐Learn's DecisionTreeClassifier.
# Use the training data you created in the previous step.
clf = DecisionTreeClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
# 7. Install graphviz by entering conda install python‐graphviz at the
# Anaconda prompt. Then plot your tree using graphviz. Try playing with
# some of the options of export_graphviz().
target_names = ['Public', 'Private']
dot_data = export_graphviz(clf, precision=2,
feature_names=predictors,
proportion=True,
class_names=target_names,
filled=True, rounded=True,
special_characters=True)
# # plot it
graph = graphviz.Source(dot_data)
graph
# 8. Use your classification tree to predict whether examples in your test
# data are public or private. Compute the confusion matrix and the accuracy
# of your predictions..
# 9. If you still have time, do the following:
# ● try building more classification trees, using different sets of input
features
# ● look at, and play with, the hyperparameters available in
DecisionTreeClassifier, especially max_depth.
# ● see how much the classification tree that you produce depends on your
particular training set
# Hints:
# 1. ‐
# 2. ‐
# 3. df['Private'] = (df['Private'] == 'Yes').astype(int)
# 4. ‐
# 5.
# predictors = ['Outstate', 'F.Undergrad']
# X = df[predictors].values
# y = df['Private'].values
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
# random_state=0)
# 6. ‐
# 7.
# dot_data = export_graphviz(clf, precision=2,
# feature_names=predictors,
# proportion=True,
# class_names=target_names,
# filled=True, rounded=True,
# special_characters=True)
# # plot it
# graph = graphviz.Source(dot_data)
# graph
# 8. ‐
# 9. ‐
# 10. ‐
# 11.