You are on page 1of 17

6/2/23, 11:22 PM Arboles y K-nn.

ipynb - Colaboratory

Arboles y K-nn en Python.

Link al archivo en Google Colab

Introduccion

Ya hicimos el analisis exploratorio... y ahora?

Vamos a probar los algoritmos que conocemos para predecir si un individuo se hubiera salvado o no en el titanic

Importar las librerias necesarias

!apt install -y graphviz


!pip install graphviz
!pip install matplotlib seaborn --upgrade

Reading package lists... Done


Building dependency tree
Reading state information... Done
graphviz is already the newest version (2.42.2-3build2).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (0.20.1)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.12.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.39.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (8.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 1/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.5.3)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.25->seaborn) (2022
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib

import pandas as pd
import numpy as np
import seaborn as sns #visualisacion
import matplotlib.pyplot as plt #visualisacion
from matplotlib import rcParams
%matplotlib inline
sns.set(color_codes=True)
pd.set_option('display.max_columns', None)
rcParams['figure.figsize'] = 12,8

from sklearn import tree


from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

Cargar el dataset en un data frame de Pandas.

Pandas es la libreria de Python mas importante para el manejo de datos tabulares. En este caso particular estoy leyendo el dataset desde
internet, pero Google Colab tiene la opcion de cargar archivos desde local o conectarse a Google Drive

df_path = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv" #"https://s3-api.us-geo.objectstorage


df = pd.read_csv(df_path)
df

Siblings/Spouses Parents/Children
Survived Pclass Name Sex Age Fare
Aboard Aboard

Mr. Owen Harris


0 0 3 male 22.0 1 0 7.2500
Braund

Mrs. John
Bradley
1 1 1 female 38.0 1 0 71.2833
(Florence Briggs
Thayer) Cum...

Miss. Laina
2 1 3 female 26.0 0 0 7.9250
Heikkinen

Mrs. Jacques
3 1 1 Heath (Lily May female 35.0 1 0 53.1000
Peel) Futrelle

Mr. William Henry


4 0 3 male 35.0 0 0 8.0500
Allen

... ... ... ... ... ... ... ... ...

Rev. Juozas
882 0 2 male 27.0 0 0 13.0000
Montvila

Modificamos el dataset en base al EDA que hicimos

Renombramos columnas
Cambiamos los tipos de datos

df['Pclass'] = df['Pclass'].astype(str)
df = df.rename(columns={"Pclass": "Boarding_class", "Siblings/Spouses Aboard": "siblings", "Parents/Children Aboard": "parent_

Que hacemos con los Outliers?

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 2/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

Como vimos anteriormente, un outlier o dato atípico es un punto o grupo de puntos que es diferente de los demas. Los datos atípicos pueden
perjudicar el rendimiento de los modelos y no nos permiten ver la distribucion real de la variable.

Vamos a revisarlo en cada modelo que hagamos, pero inicialmente, los arboles de decision son indiferentes a los outliers por la forma en la
que dividen la data y como se hacen los cálculos de cada rama (en base a la clase). Po el momento no vamos a realizar ninguna acción

Estado de la "clase"

Volvamos a revisar nuestra variable de clase

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 887 non-null int64
1 Boarding_class 887 non-null object
2 Name 887 non-null object
3 Sex 887 non-null object
4 Age 887 non-null float64
5 siblings 887 non-null int64
6 parent_children 887 non-null int64
7 Fare 887 non-null float64
dtypes: float64(2), int64(3), object(3)
memory usage: 55.6+ KB

sns.countplot(x=df.Survived)

<Axes: xlabel='Survived', ylabel='count'>

Preparamos nuestro dataset de training

Antes de crear nuestro set de train y test, tenemos que ocuparnos de las variables categoricas, la mayoria de los modelos no las soportan y
tenemos que convertirlas a algo con lo que puedan trabajar. Tenemos las siguientes:

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 3/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
Name = Por ahora vamos a dejarla hasta que entrenemos el modelo, la vamos a reservar para entender los resultados luego.
Boarding_class y Sex = vamos a tranformarla a One-hot encoding

df_model = pd.concat([df, pd.get_dummies(df.Sex), pd.get_dummies(df.Boarding_class, prefix="class")], axis=1)


df_model

Survived Boarding_class Name Sex Age siblings parent_children Fare female m

Mr. Owen
0 0 3 Harris male 22.0 1 0 7.2500 0
Braund

Mrs. John
Bradley
(Florence
1 1 1 female 38.0 1 0 71.2833 1
Briggs
Thayer)
Cum...

Miss.
2 1 3 Laina female 26.0 0 0 7.9250 1
Heikkinen

Mrs.
Jacques
Heath
3 1 1 female 35.0 1 0 53.1000 1
(Lily May
Peel)
Futrelle

Mr

Ahora separamos las features de la clase en dataframes distintos, vamos a rem Despues aplicamos la funcion de la libreria scikit-learn para
generar los set de train y test

features = df_model.columns.tolist()
features.remove("Survived")
features.remove("Boarding_class")
features.remove("Sex")
X = df_model.loc[:, features]
y = df_model.loc[:, ["Survived"]]

Name Age siblings parent_children Fare female male class_1 class_2 class_3

Mr. Owen
0 Harris 22.0 1 0 7.2500 0 1 0 0 1
Braund

Mrs. John
Bradley
(Florence
1 38.0 1 0 71.2833 1 0 1 0 0
Briggs
Thayer)
Cum...

Miss. Laina
2 26.0 0 0 7.9250 1 0 0 0 1
Heikkinen

Mrs.
Jacques
3 Heath (Lily 35.0 1 0 53.1000 1 0 1 0 0
May Peel)
Futrelle

Mr. William
4 35.0 0 0 8.0500 0 1 0 0 1
Henry Allen

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 4/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

Survived

0 0

1 1

2 1

3 1

4 0
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size = .75, stratify=y)
... ...

Veamos
882los parametros:
0

X883
= Nuestro set
1 de features

y884
= Nuestra columna
0 de labels
random_state
885
= El split en train y test se hace de manera aleatoria, este valor oficia de semilla para poder reproducir el mismo split luego
1
train_size = El tamaño del set de training, en este caso, el 75% del total
886 0
stratify = Al tener una clase desbalanceada, este parametro permite tener un ratio similar de positivos entre los dos sets
887 rows × 1 columns
Comprobemos las clases en la salida

X_train.set_index("Name", inplace=True)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 665 entries, Mrs. William (Margaret Norton) Rice to Mr. Reginald Charles Coleridge
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 665 non-null float64
1 siblings 665 non-null int64
2 parent_children 665 non-null int64
3 Fare 665 non-null float64
4 female 665 non-null uint8
5 male 665 non-null uint8
6 class_1 665 non-null uint8
7 class_2 665 non-null uint8
8 class_3 665 non-null uint8
dtypes: float64(2), int64(2), uint8(5)
memory usage: 29.2+ KB

X_train

Age siblings parent_children Fare female male class_1 class_2 class_3

Name

Mrs. William
(Margaret 39.0 0 5 29.1250 1 0 0 0 1
Norton) Rice

Mrs. Jacques
Heath (Lily May 35.0 1 0 53.1000 1 0 1 0 0
Peel) Futrelle

Mrs. Lizzie
(Elizabeth Anne
29.0 1 0 26.0000 1 0 0 1 0
Wilkinson)
Faunthorpe

Mrs. (Lutie
50.0 0 1 26.0000 1 0 0 1 0
Davis) Parrish

Mr. James Clinch


56.0 0 0 30.6958 0 1 1 0 0
Smith

... ... ... ... ... ... ... ... ... ...

Mr. Youssef
16.0 2 0 21.6792 0 1 0 0 1
Samaan

X_test.set_index("Name", inplace=True)
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 222 entries, Miss. Annie Jessie Harper to Miss. Margit Elizabeth Skoog
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 222 non-null float64
1 siblings 222 non-null int64
2 parent_children 222 non-null int64
3 Fare 222 non-null float64
4 female 222 non-null uint8

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 5/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
5 male 222 non-null uint8
6 class_1 222 non-null uint8
7 class_2 222 non-null uint8
8 class_3 222 non-null uint8
dtypes: float64(2), int64(2), uint8(5)
memory usage: 9.8+ KB

De los 887 registros, 665 estan en train y 222 en test Veamos ahora los labels

sns.countplot(x=y_train.Survived)

<Axes: xlabel='Survived', ylabel='count'>

sns.countplot(x=y_test.Survived)

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 6/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
<Axes: xlabel='Survived', ylabel='count'>

Como podemos observar, el ratio entre los valores de supervivencia se mantuvo del set original a los sets de training y test

Arbol de decisión

Que sabemos de los arboles?

Algoritmo de aprendizaje SUPERVISADO - necesitamos un dataset de entrenamiento con etiquetas

Eager learning o Aprendizaje Ansioso - el modelo final es un binario que no depende de los datos de entrenamiento (a diferencia de K-nn)

Pueden hacer Clasificación o Regresión (predecir una etiqueta o un valor)

Un arbol de decisión se construye con:

Nodos de decision - corresponden a features (columnas)


Nodos hoja - corresponden a las etiquetas de la clase

El nodo raiz de un arbol deberia ser la variable que mejor predice

base_tree = tree.DecisionTreeClassifier(criterion="entropy", random_state = 0)

Esta es la definicion del arbol, por ahora solo tiene 2 parametros:

criterion = La formula que se va a aplicar para usar el mejor atributo en cada decision. Por ahora usamos "entropy" que es el que vimos
en clase
random_state = De nuevo, la semilla para la funcion random para asegurar la reproductibilidad de los resultados

Ahora entrenamos el arbol

base_tree.fit(X_train, y_train)

▾ DecisionTreeClassifier
DecisionTreeClassifier(criterion='entropy', random_state=0)

tree.plot_tree(base_tree)

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 7/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
Text(0.3407188221709007, 0.875, 'x[0] <= 13.0\nentropy = 0.802\nsamples = 291\nvalue =
[220, 71]'),
Text(0.21939953810623555, 0.825, 'x[1] <= 2.5\nentropy = 0.99\nsamples = 34\nvalue = [15,
19]'),
Text(0.20092378752886836, 0.775, 'x[2] <= 0.5\nentropy = 0.297\nsamples = 19\nvalue = [1,
18]'),
Text(0.19168591224018475, 0.725, 'x[0] <= 11.5\nentropy = 1.0\nsamples = 2\nvalue = [1,
1]'),
Text(0.18244803695150116, 0.675, 'entropy = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(0.20092378752886836, 0.675, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.21016166281755197, 0.725, 'entropy = 0.0\nsamples = 17\nvalue = [0, 17]'),
Text(0.23787528868360278, 0.775, 'x[0] <= 3.5\nentropy = 0.353\nsamples = 15\nvalue = [14,
1]'),
Text(0.22863741339491916, 0.725, 'x[0] <= 2.5\nentropy = 0.65\nsamples = 6\nvalue = [5,
1]'),
Text(0.21939953810623555, 0.675, 'entropy = 0.0\nsamples = 5\nvalue = [5, 0]'),
Text(0.23787528868360278, 0.675, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.2471131639722864, 0.725, 'entropy = 0.0\nsamples = 9\nvalue = [9, 0]'),
Text(0.4620381062355658, 0.825, 'x[6] <= 0.5\nentropy = 0.727\nsamples = 257\nvalue = [205,
52]'),
Text(0.3340069284064665, 0.775, 'x[3] <= 7.988\nentropy = 0.556\nsamples = 170\nvalue =
[148, 22]'),
Text(0.26558891454965355, 0.725, 'x[0] <= 41.5\nentropy = 0.991\nsamples = 9\nvalue = [5,
4]'),
Text(0.25635103926096997, 0.675, 'x[1] <= 1.5\nentropy = 0.954\nsamples = 8\nvalue = [5,
3]'),
Text(0.2471131639722864, 0.625, 'x[1] <= 0.5\nentropy = 0.985\nsamples = 7\nvalue = [4,
3]'),
Text(0.23787528868360278, 0.575, 'x[0] <= 26.0\nentropy = 0.918\nsamples = 6\nvalue = [4,
2]'),
Text(0.22863741339491916, 0.525, 'entropy = 0.0\nsamples = 2\nvalue = [2, 0]'),
Text(0.2471131639722864, 0.525, 'x[0] <= 31.5\nentropy = 1.0\nsamples = 4\nvalue = [2,
2]'),
Text(0.23787528868360278, 0.475, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.25635103926096997, 0.475, 'x[0] <= 35.5\nentropy = 0.918\nsamples = 3\nvalue = [2,
1]'),
Text(0.2471131639722864, 0.425, 'entropy = 1.0\nsamples = 2\nvalue = [1, 1]'),
Text(0.26558891454965355, 0.425, 'entropy = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(0.25635103926096997, 0.575, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.26558891454965355, 0.625, 'entropy = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(0.2748267898383372, 0.675, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.40242494226327946, 0.725, 'x[1] <= 0.5\nentropy = 0.505\nsamples = 161\nvalue =
[143, 18]'),
Text(0.3683602771362587, 0.675, 'x[3] <= 44.748\nentropy = 0.589\nsamples = 120\nvalue =
[103, 17]'),
Text(0.3464203233256351, 0.625, 'x[3] <= 13.931\nentropy = 0.518\nsamples = 112\nvalue =
[99, 13]'),
Text(0.3371824480369515, 0.575, 'x[3] <= 13.681\nentropy = 0.588\nsamples = 92\nvalue =
[79, 13]'),
Text(0.3279445727482679, 0.525, 'x[0] <= 19.5\nentropy = 0.563\nsamples = 91\nvalue = [79,
12]'),
Text(0.29330254041570436, 0.475, 'x[3] <= 11.0\nentropy = 0.811\nsamples = 12\nvalue = [9,
3]'),
Text(0.2840646651270208, 0.425, 'x[3] <= 8.356\nentropy = 0.881\nsamples = 10\nvalue = [7,
3]'),
Text(0.26558891454965355, 0.375, 'x[0] <= 18.5\nentropy = 0.971\nsamples = 5\nvalue = [3,
2]'),
Text(0.25635103926096997, 0.325, 'x[0] <= 17.0\nentropy = 0.918\nsamples = 3\nvalue = [1,
2]'),
Text(0.2471131639722864, 0.275, 'entropy = 1.0\nsamples = 2\nvalue = [1, 1]'),
Text(0.26558891454965355, 0.275, 'entropy = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(0.2748267898383372, 0.325, 'entropy = 0.0\nsamples = 2\nvalue = [2, 0]'),
Text(0.302540415704388, 0.375, 'x[0] <= 18.0\nentropy = 0.722\nsamples = 5\nvalue = [4,
1]'),
Text(0.29330254041570436, 0.325, 'entropy = 0.0\nsamples = 3\nvalue = [3, 0]'),
Text(0.3117782909930716, 0.325, 'entropy = 1.0\nsamples = 2\nvalue = [1, 1]'),
Text(0.302540415704388, 0.425, 'entropy = 0.0\nsamples = 2\nvalue = [2, 0]'),
Text(0.3625866050808314, 0.475, 'x[0] <= 22.5\nentropy = 0.512\nsamples = 79\nvalue = [70,
9]'),
Text(0.3533487297921478, 0.425, 'entropy = 0.0\nsamples = 14\nvalue = [14, 0]'),
Text(0.371824480369515, 0.425, 'x[3] <= 8.081\nentropy = 0.58\nsamples = 65\nvalue = [56,
9]'),

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 8/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

Como la imagen es muy pequeña para revisarlo, vamos a exportarlo y usar un lector de grafos para revisarlo.

Abrimos el .dot con un bloc de notas y copiamos el contenido

fn=X_train.columns.tolist()
cn=['Deseased', 'Survived']
tree.export_graphviz(base_tree,
out_file="base_tree.dot",
feature_names = fn,
class_names=cn,
filled = True)

Como se puede apreciar, a mayor cantidad de variables mayor es la complejidad del arbol y mas dificil su visualizacion. Tambien es posible que
al generar tantos nodos este sobreajustando (overfitting)

Ahora hagamos una prediccion y veamos la matriz de confusion

y_pred = base_tree.predict(X_test)

from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_pred)


ax = sns.heatmap(matrix, annot=True, cmap='Blues', fmt='g')

ax.set_title('Matriz de Confusión\n\n');
ax.set_xlabel('\nPredicción')
ax.set_ylabel('Real ');

ax.xaxis.set_ticklabels(['0','1'])
ax.yaxis.set_ticklabels(['0','1'])

plt.show()

De la matriz podemos ver que, de los 86 casos de supervivencia, 57 son predichos correctamente y 29 no. Asimismo, de los 136 que no
sobrevivieron, 16 son predichos como sobrevivientes.

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 9/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
Veamos ahora la metrica de Accuracy

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.81 0.88 0.84 136


1 0.78 0.66 0.72 86

accuracy 0.80 222


macro avg 0.79 0.77 0.78 222
weighted avg 0.80 0.80 0.79 222

El classification report de la libreria scikit-learn nos muestra las metricas mas habituales de clasificación:

Accuracy
Precision
Recall
F1

Estas ultimas son calculadas por clase, por lo que el reporte nos brinda 2 promedios:

Macro Average: Es el promedio de las metricas por clase. Por ejemplo con Precision es (0.81 + 0.78) / 2 = 0.79
Weigthed Average: Es el promedio de la metrica, pero cada valor multiplicado por el porcentaje de soporte de la clase, en este caso, la
clase 0 tiene 136/222 = 0.61 y la clase 1 86/222 = 0.38. Siendo la formula de promedio ((136/222) * 0.81 + (86/222) * 0.78) = 0.7983 ~
0.80.

Este último no es util en problemas con clases desbalanceadas como el nuestro ya que le da mas peso a la metrica de la clase mayoritaria y
habitualmente no es lo que se busca. Es mas util en problemas multiclase.

Para mas informacion sobre las métricas pueden revisar el siguiente link

Entonces, nuetro arbol tiene un 80% de Accuracy, el resto de las metricas en promedio estan cercanas a esa con lo cual podemos seguir
utilizandola aunque la clase este desbalanceada.

Este primer modelo será nuestro Baseline, revisemos si podemos mejorarlo un poco con hiperparametros

Vamos a realizar una busqueda de parámetros, como no queremos usar el set de testing para un modelo que no sea final, y que nuestro set de
training ya es relativamente chico como para volver a separarlo en train y test, vamos a usar Cross Validation.

Mas info sobre la tecnica aqui

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 10/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

base_tree.get_params()

{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'entropy',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 0,
'splitter': 'best'}

Esta es la lista de hiperparametros del arbol de clasificacion, para mas informacion sobre los mismos revisen la documentacion de scikit-learn

Ahora vamos a definir nuestra lista de valores para los mismos o "grid" sobre la cual se va a realizar la busqueda. En la misma vamos a incluir
los del baseline ya que alguna combinacion podria incluirlos

parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12, None],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,50,80,90],
"min_samples_leaf":[1,2,4,9],
"min_samples_split": [1,3,5],
"min_weight_fraction_leaf": [0.0, 0.1,0.5,0.9],
}

from sklearn.model_selection import GridSearchCV

tuning_model=GridSearchCV(base_tree, param_grid=parameters, scoring='accuracy', cv=3, verbose=3)

tuning_model.fit(X_train, y_train)

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 11/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory
[CV /3] ND a _dept No e, a _ eatu es No e, a _ ea _ odes 80, _sa p es_ ea , _s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=4, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 1/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 2/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s
[CV 3/3] END max_depth=None, max_features=None, max_leaf_nodes=80, min_samples_leaf=9, min_s

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 12/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 13/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 14/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 15/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 16/17
6/2/23, 11:22 PM Arboles y K-nn.ipynb - Colaboratory

https://colab.research.google.com/drive/1GUZTuXUwPp7JscTOsZ5sNaIOexFxBdfS#scrollTo=uqHIyGCnX1ZJ&printMode=true 17/17

You might also like