You are on page 1of 21

Machine Learning in Python 2

Dr. Hafeez
Prepare for Modeling
Pre-processing data
• Raw data might not be ready in the best shape
for modeling
• Pre-processing data is required
• To best present inherit structure in the data to
model algo
• What python offers to pre-process data?
scikit-learn offers
• Two standard idioms for transforming data
– Fit and multiple transform
– Combined fit-and-transform
• Techniques to prepare data for modeling
– Standardize numerical data (mean=0 and stdev=1)
• Through scale and center options
– Normalize numerical data (0-1)
• Through range option
– Explore advanced feature engineering
• Binarizing
Example: Pima inidans diabetes dataset
• Calculate parameters to standardize the data
• Create a standardize copy of input data
• Standardize data (mean=0, stdev=1)
• From sklearn.preprocessing import StandarScaler
• Import pandas
• Import numpy
• url=https://goo.gl/bDdBiA
• names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
Code
• dataframe = pandas.read_csv(url, names=names)
• array = dataframe.values
• # separate array into input and output components
• X = array[:,0:8]
• Y = array[:,8]
• scaler = StandardScaler().fit(X)
• rescaledX = scaler.transform(X)
• # summarize transformed data
• numpy.set_printoptions(precision=3)
• print(rescaledX[0:5,:])
 
Resampling Methods: Algorithm evaluation

• The data split used to train a machine learning algo is


called training dataset
 Problem:
• However, such data split cannot be used to provide
reliable estimates of accuracy for the model on
new/unseen data
• Nonetheless, whole idea of creating model was to
enable predictions on new data
 Solution:
• Use resampling methods
Resampling Methods: Algorithm Evaluation

• Use statistical methods called resampling


methods
• Split your training data into further subsets
• Use some of the subsets for training and
remaining subsets to estimate the accuracy of
the model on unseen data
Resampling Methods: Algorithm Evaluation

• In nutshell:
• Split dataset into training and test sets
• estimate accuracy of an ML algo using k-fold cross
validation
– Splits training data into k subsets
• Estimate accuracy of an ML algo using leave one out
cross validation
 Next, use scikit-learn to estimate accuracy of Logistic
regression on Pima Indians of diabetes using 10-fold
cross validation
Evaluate using cross validation
• from pandas import read_csv
• from sklearn.model_selection import Kfold
• from sklearn.model_selection import
cross_val_score
• from sklearn.linear_model import
LogisticRegression
• url = "https://goo.gl/bDdBiA" names = ['preg',
'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
Evaluate using cross validation
• dataframe = read_csv(url, names=names)
• array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
• kfold = KFold(n_splits=10, random_state=7)
• model = LogisticRegression()
• results = cross_val_score(model, X, Y, cv=kfold)
• print("Accuracy: %.3f%% (%.3f%%)" %
(results.mean()*100.0, results.std()*100.0))
Algorithm evaluation metrics
• Metrics to harness the ML algorithms in scikit-learn library
– cross_val_score()
– Defaults can be used for regression and classification problems
• Practice accuracy and kappa metrics on a classification
problem
• Practice how to generate confusion matrix and a
classification report
• Practice how to use RMSE and Rsquared metrics on a
regression problem
Algorithm evaluation metrics
• Calculate LogLoss metric on Pima Indians onset of diabetes dataset
• from pandas import read_csv
• from sklearn.model_selection import Kfold
• from sklearn.model_selection import cross_val_score
• from sklearn.linear_model import LogisticRegression
• url = "https://goo.gl/bDdBiA" names = ['preg', 'plas', 'pres', 'skin',
'test', 'mass', 'pedi', 'age', 'class’]
• dataframe = read_csv(url, names=names)
• array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
Algorithm evaluation metrics
• kfold = Kfold(n_splits=10, random_state=7)
• model = LogisticRegression()
• scoring = 'neg_log_loss’
• results = cross_val_score(model, X, Y,
• cv=kfold, scoring=scoring)
• print("Logloss: %.3f (%.3f)" % (results.mean(),
results.std())​)​
Spot-Check ML Algorithm
• Difficult to know which ML Algo will perform best on the data
beforehand
• Trail and error, also spot-checking
• Scikit-learn library provides tools to compare the estimated
accuracy of these algos
• Spot-check linear algorithm on a dataset
– Linear regression, logistic regression and linear discriminant analysis
• Spot-check non-linear algorithm on a dataset
– kNN, SVM and CART
• Spot-check sophisticated ensemble algo on a dataset
– Random forest and stochastic gradient boosting
Spot-checking example
• k-Nearest Neighbor algo on Boston House Price
dataset
• #kNN Regression
• from pandas import read_csv  from
sklearn.model_selection import KFold from
sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
url = "https://goo.gl/FmJUSM" names = ['CRIM', 'ZN',
'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV']
Spot-checking example
• k-Nearest Neighbor algo on Boston House Price dataset
• dataframe = read_csv(url, delim_whitespace=True,
names=names)
• array = dataframe.values
• X = array[:,0:13] Y = array[:,13]
• kfold = KFold(n_splits=10, random_state=7) model =
KNeighborsRegressor()
• scoring = 'neg_mean_squared_error'
• results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
• print(results.mean())
Model comparison and selection
• Next, you need to compare estimated
performance of different algos and then,
select the best one
• Compare linear algos with each other for a
given dataset
• Compare non-linear algos with each other for
a given dataset
• Create plots of the results comparing algos
Model comparison and selection
• Example shows logistic regression and linear discriminant analysis
on Pima Indians diabetes dataset
• # Compare Algorithms
• from pandas import read_csv
• from sklearn.model_selection import KFold
• from sklearn.model_selection import cross_val_score
• from sklearn.linear_model import LogisticRegression
• from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
• # load dataset url = "https://goo.gl/bDdBiA"
• names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Model comparison and selection
• dataframe = read_csv(url, names=names) array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
• # prepare models
• models = []
• models.append(('LR', LogisticRegression()))
• models.append(('LDA', LinearDiscriminantAnalysis()))
• # evaluate each model in turn
• results = []
• names = []
• scoring = 'accuracy'
Model comparison and selection
• for name, model in models:     
– kfold = KFold(n_splits=10,random_state=7)
– cv_results = cross_val_score(model, X,
Y,cv=kfold,scoring=scoring)     
– results.append(cv_results)     
– names.append(name)     
– msg="%s:%f(%f)"%(name,cv_results.mean(),
cv_results.std())     
– print(msg)
Algorithm Tuning

You might also like