Professional Documents
Culture Documents
You now have had the opportunity to work with a range of supervised machine learning techniques for both
classification and regression. Before you apply these in the project, let's do one more example to see how the
machine learning process works from beginning to end with another popular dataset.
We will start out by reading in the dataset and our necessary libraries. You will then gain an understanding of
how to optimize a number of models using grid searching as you work through the notebook.
In [1]:
import check_file as ch
%matplotlib inline
---------------------------------------------------------------------------
~\AppData\Local\Temp/ipykernel_6488/1324965867.py in <module>
11 sns.set(style="ticks")
12
14
15 get_ipython().run_line_magic('matplotlib', 'inline')
Because this course has been aimed at understanding machine learning techniques, we have largely ignored
items related to parts of the data analysis process that come before building machine learning models -
exploratory data analysis, feature engineering, data cleaning, and data wrangling.
Step 1: Let's do a few steps here. Take a look at some of usual summary statistics calculated to
accurately match the values to the appropriate key in the dictionary below.
In [2]:
In [3]:
diabetes.describe()
Out[3]:
In [4]:
diabetes.isnull().sum()
diabetes.hist();
In [5]:
sns.heatmap(diabetes.corr(), annot=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3755fbf9e8>
In [7]:
Step 2: Since our dataset here is quite clean, we will jump straight into the machine learning.
Our goal here is to be able to predict cases of diabetes. First, you need to identify the y vector
and X matrix. Then, the following code will divide your dataset into training and test data.
In [8]:
y = diabetes['Outcome']
X = diabetes[['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness','Insulin', 'BMI', '
Now that you have a training and testing dataset, we need to create some models that and ultimately find the
best of them. However, unlike in earlier lessons, where we used the defaults, we can now tune these models to
be the very best models they can be.
It can often be difficult (and extremely time consuming) to test all the possible hyperparameter combinations to
find the best models. Therefore, it is often useful to set up a randomized search.
In practice, randomized searches across hyperparameters have shown to be more time confusing, while still
optimizing quite well. One article related to this topic is available here
(https://blog.h2o.ai/2016/06/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/).
The documentation for using randomized search in sklearn can be found here (http://scikit-
learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-
selection-plot-randomized-search-py) and here (http://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).
In order to use the randomized search effectively, you will want to have a pretty reasonable understanding of
the distributions that best give a sense of your hyperparameters. Understanding what values are possible for
your hyperparameters will allow you to write a grid search that performs well (and doesn't break).
Step 3: In this step, I will show you how to use randomized search, and then you can set up
grid searches for the other models in Step 4. However, you will be helping, as I don't remember
exactly what each of the hyperparameters in SVMs do. Match each hyperparameter to its
corresponding tuning functionality.
In [9]:
# build a classifier
clf_rf = RandomForestClassifier()
Step 4: Now that you have seen how to run a randomized grid search using random forest, try
this out for the AdaBoost and SVC classifiers. You might also decide to try out other classifiers
that you saw earlier in the lesson to see what works best.
In [10]:
In [11]:
Step 5: Use the test below to see if your best model matched, what we found after running the
grid search.
In [12]:
a = 'randomforest'
b = 'adaboost'
c = 'supportvector'
ch.check_best(best_model)
Nice! It looks like your best model matches the best model I found as well!
It makes sense to use f1 score to determine best in this case given the imba
lance of classes. There might be justification for precision or recall bein
g the best metric to use as well - precision showed to be best with adaboost
again. With recall, SVMs proved to be the best for our models.
Once you have found your best model, it is also important to understand why it is performing well. In regression
models where you can see the weights, it can be much easier to interpret results.
Step 6: Despite the fact that your models here are more difficult to interpret, there are some
ways to get an idea of which features are important. Using the "best model" from the previous
question, find the features that were most important in helping determine if an individual would
have diabetes or not. Do your conclusions match what you might have expected during the
exploratory phase of this notebook?
In [13]:
# Show your work here - the plot below was helpful for me
# https://stackoverflow.com/questions/44101458/random-forest-feature-importance-chart-using
features = diabetes.columns[:diabetes.shape[1]]
importances = random_search.best_estimator_.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance');
sol_seven = {
'The variable that is most related to the outcome of diabetes' :f, # letter here,
'The second most related variable to the outcome of diabetes' :c, # letter here,
'The third most related variable to the outcome of diabetes' :a,
# letter here,
'The fourth most related variable to the outcome of diabetes' :d # letter here
}
ch.check_q_seven(sol_seven)
That's right! Some of these were expected, but some were a bit unexpected t
oo!
Step 8: Now provide a summary of what you did through this notebook, and how you might
explain the results to a non-technical individual. When you are done, check out the solution
notebook by clicking the orange icon in the upper left.
Three advanced modeling techniques were used to predict whether or not a patient has diabetes. The most
successful of these techniques proved to be an AdaBoost Classification technique, which had the following
metrics:
Based on the initial look at the data, it is unsurprising that Glucose, BMI, and Age were important in
understanding if a patient has diabetes. These were consistent with more sophisticated approaches. Interesting
findings were that pregnancy looked to be correlated when initially looking at the data. However, this was likely
due to its large correlation with age.