Professional Documents
Culture Documents
Using Bayes' rule above, we label a new case X with a class level
Cj that achieves the highest posterior probability.
• Naive Bayes in scikit-learn- scikit-learn implements
three naive Bayes variants based on the same number
of different probabilistic distributions:
• Bernoulli, multinomial, and Gaussian.
• The first one is a binary distribution, useful
when a feature can be present or absent.
• The second one is a discrete distribution
and is used whenever a feature must be
represented by a whole number (for
example, in natural language processing, it
can be the frequency of a term),
• while the third is a continuous distribution
characterized by its mean and variance.
Bernoulli naive Bayes
• If X is random variable and is Bernoulli-
distributed, it can assume only two
values (for simplicity, let's call them 0
and 1) and their probability is:
• we're going to generate a dummy dataset.
Bernoulli naive Bayes expects binary feature
vectors; however, the class BernoulliNB has a
binarize parameter, which allows us to specify
a threshold that will be used internally to
transform the features:
from sklearn.datasets import
make_classification
>>> nb_samples = 300
>>> X, Y =
make_classification(n_samples=nb_samples,
n_features=2, n_informative=2,
n_redundant=0)
• We have decided to use 0.0 as a binary threshold,
so each point can be characterized by the
quadrant where it's located.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import
train_test_split
>>> X_train, X_test, Y_train, Y_test =
train_test_split(X, Y, test_size=0.25)
>>> bnb = BernoulliNB(binarize=0.0)
>>> bnb.fit(X_train, Y_train)
>>> bnb.score(X_test, Y_test)
0.85333333333333339
The score is rather good, but if we want to understand how the
binary classifier worked, it's useful to see how the data has been
internally binarized:
• Now, checking the naive Bayes predictions, we obtain:
>>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> bnb.predict(data)
array([0, 0, 1, 1])
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0.,
1., 1., 1., 1., 0., 1., ...., 1.])
Fitting a Support Vector Machine
• Now we’ll fit a Support Vector Machine
Classifier to these points. While the
mathematical details of the likelihood model
are interesting, we’ll let read about those
elsewhere. Instead, we’ll just treat the scikit-
learn algorithm as a black box which
accomplishes the above task.
# import support vector classifier
from sklearn.svm import SVC # "Support Vector
Classifier"
clf = SVC(kernel='linear')
# fitting x samples and y classes
clf.fit(x, y)
• After being fitted, the model can then be used
to predict new values:
clf.predict([[120, 990]])
clf.predict([[85, 550]])
array([ 0.]) array([ 1.]) This is obtained by
analyzing the data
taken and pre-
processing
methods to make
optimal
hyperplanes using
matplotlib
function.
Kernel Methods and
Nonlinear Classification
Often we want to capture nonlinear
patterns in the data
Nonlinear Regression: Input-output
relationship may not be linear
Nonlinear Classification: Classes may not be
separable by a linear boundary
• Often we want to capture nonlinear
patterns in the data
• Nonlinear Regression: Input-output
relationship may not be linear Nonlinear
Classification: Classes may not be separable
by a linear boundary
• Linear models (e.g., linear regression,
linear SVM) are not just rich enough
• Kernels: Make linear models work in
nonlinear settings
• By mapping data to higher dimensions where it
exhibits linear patterns Apply the linear model
in the new input space
• Mapping ≡ changing the feature representation
• Note: Such mappings can be expensive to
compute in general
• Kernels give such mappings for (almost) free
• In most cases, the mappings need not be even
computed
• .. using the Kernel Trick!
Classifying non-linearly separable
Let’s look at another example:
data
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ:X→F
k : X × X → R, k(x, z) = φ(x)⊤φ(z)
There must exist a Hilbert Space F for which k defines a dot product
∫ ∫
d x d zf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
Mercer’s Condition
K is a symmetric matrix
⊤ N 9 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector
inputs Linear (trivial) Kernel:
k(x, z) = x⊤z (mapping function φ is identity - no mapping) Quadratic
Kernel:
k(x, z) = (x⊤z)2 or (1 + x⊤z)2
Polynomial Kernel (of degree d ):
k(x, z) = (x⊤z)d or (1 + x⊤z)d
Radial Basis Function (RBF)
Kernel: k(x, z) = exp[−γ||x − z||2]
γ is a hyperparameter (also called the kernel
bandwidth)
The RBF kernel corresponds to an infinite
dimensional feature space F (i.e., you can’t actually
write down the vector φ(x))
Note: Kernel hyperparameters (e.g., d , γ)
Using Kernels
Kernels can turn a linear model into a nonlinear one
N
Σ
subject to αn yn = 0, 0 ≤ αn ≤ C ; n = 1, . . . , N
n=1
n∈SV
>>> gs.best_estimator_
SVC(C=2.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None,
degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> gs.best_score_
0.87
As expected from the geometry of our dataset, the best kernel is a radial
basis function, which yields 87% accuracy.
So the best estimator is polynomial-based with degree=2, and the
corresponding accuracy is: 0.96
Controlled support vector machines
* With real datasets, SVM can extract a very large
number of support vectors to increase accuracy, and
that can slow down the whole process.
* To allow finding out a trade-off between precision
and number of support vectors, scikit-learn provides
an implementation called NuSVC, where the
parameter nu (bounded between 0—not included—
and 1) can be used to control at the same time the
number of support vectors (greater values will
increase their number) and training errors (lower
values reduce the fraction of errors).
* Let's consider an example with a linear kernel and a simple
dataset. In the following figure, there's a scatter plot of our
set:
• Let's start checking the number of support
vectors for a standard SVM:
>>> svc = SVC(kernel='linear')
>>> svc.fit(X, Y)
>>> svc.support_vectors_.shape (242L, 2L)
>>> cross_val_score(nusvc, X, Y, scoring='accuracy',
cv=10).mean() 0.80633213285314143
• As expected, the behavior is similar to a standard SVC. Let's
now reduce the value of nu:
>>> nusvc = NuSVC(kernel='linear', nu=0.15)
>>> nusvc.fit(X, Y)
>>> nusvc.support_vectors_.shape (78L, 2L)
>>> cross_val_score(nusvc, X, Y, scoring='accuracy',
cv=10).mean() 0.67584393757503003
• In this case, the number of support vectors is
less than before and also the accuracy has
been affected by this choice. Instead of trying
different values, we can look for the best
choice with a grid search:
import numpy as np
>>> param_grid = [
{ 'nu': np.arange(0.05, 1.0, 0.05) } ]
>>> gs =
GridSearchCV(estimator=NuSVC(kernel='linear'),
param_grid=param_grid, scoring='accuracy', cv=10,
n_jobs=multiprocessing.cpu_count())
>>> gs.fit(X, Y)
GridSearchCV(cv=10, error_score='raise',
estimator=NuSVC(cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, nu=0.5, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False), fit_params={}, iid=True,
n_jobs=8, param_grid=[{'nu': array([ 0.05, 0.1 , 0.15, 0.2 , 0.25,
0.3 ,0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75,
0.8 , 0.85, 0.9 , 0.95])}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring='accuracy',
verbose=0)
>>> gs.best_estimator_
• NuSVC(cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, nu=0.5, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)