You are on page 1of 31

Anomaly(

detec-on(
Problem(
Text
Problem
Text
Text

mo-va-on(
Machine(Learning(
2

Anomaly(detec-on(example(
Aircra9(engine(features:( Dataset:(
(=(heat(generated( (
(=(vibra-on(intensity( New(engine:(
(((((…( (vibra-on)(

(heat)(
Andrew(Ng(
3

Density(es-ma-on(

Dataset:(
Is((((((((((((anomalous?(

Flag
anomaly
(vibra-on)(

anomaly

(heat)(
Andrew(Ng(
4

Anomaly(detec-on(example(
Fraud(detec-on:(
((((((((((((((=(features(of(user(((’s(ac-vi-es(
Model(((((((((from(data.(
Iden-fy(unusual(users(by(checking(which(have((
Manufacturing(
Monitoring(computers(in(a(data(center.(
(((((((=(features(of(machine((
(((((((=(memory(use,((((((((=(number(of(disk(accesses/sec,(
(((((((=(CPU(load,((((((((=(CPU(load/network(traffic.(
…(
(
Andrew(Ng(
Anomaly(
detec-on(
Gaussian(
distribu-on(
Machine(Learning(
6

Gaussian((Normal)(distribu-on(
Say((((((((((.(If((((is(a(distributed(Gaussian(with(mean((((,(variance((((((.(
Standard deviation
“distributed as”

Andrew(Ng(
7

Gaussian(distribu-on(example(

The area under the


curve, that is the
shaded area, will
always integrate to
one, that's the
property of probability
distributions and

Andrew(Ng(
8

Parameter(es-ma-on(
Dataset:(
So if you're given a data set like
this, maybe the estimation of
what Gaussian distribution the
data came from might be The data has a very high probability of
roughly the Gaussian being in the central region, and a low
distribution with “u” being the probability of being further out, even
center of the distribution, though probability of being further out,
“sigma” standing for the and so on. So maybe this is a reasonable
deviation controlling the width estimate of “u” and “sigma squared”.
of this Gaussian distribution.
Seems like a reasonable fit to
the data.

Andrew(Ng(
Anomaly(
detec-on(
Algorithm(
Machine(Learning(
10

The problem of estimating the distribution p(x) is sometimes


called the problem of density estimation
11

Anomaly(detec-on(algorithm(
1.  Choose(features((((((that(you(think(might(be(indica-ve(of(
anomalous(examples.(
2.  Fit(parameters(
(
(
(
(
3.  Given(new(example((((,(compute(((((((((:((

(
(((((((Anomaly(if((
Andrew(Ng(
12
13

Plot of p(x)
14

Ok
Anomaly
15

Anomaly(detec-on(example(

High probability

Low probability Andrew(Ng(


Anomaly(
detec-on(
Developing(and(
evalua-ng(an(anomaly(
detec-on(system(
Machine(Learning(
17

The(importance(of(real?number(evalua-on(
When(developing(a(learning(algorithm((choosing(features,(etc.),(
making(decisions(is(much(easier(if(we(have(a(way(of(evalua-ng(
our(learning(algorithm.(
Assume(we(have(some(labeled(data,(of(anomalous(and(nonX
anomalous(examples.(((((((((((((((if(normal,(((((((((((((if(anomalous).(
Training(set: ( ( ((((((((assume(normal(examples/not(
anomalous)(
Cross(valida-on(set:(
Test(set:(

Andrew(Ng(
18

AircraA(engines(mo-va-ng(example(
10000((good((normal)(engines(
20( (flawed(engines((anomalous)(

Training(set:(6000(good(engines(
CV:(2000(good(engines((( ((),(10(anomalous((((((((((((()(
Test:(2000(good(engines(((((((((((((),(10(anomalous((((((((((((()(

Alterna-ve:(Less recommended. The same set of CV and Test is not a good ML practice
Training(set:(6000(good(engines(
CV:(4000(good(engines((( ((),(10(anomalous((((((((((((()(
Test:(4000(good(engines(((((((((((((),(10(anomalous((((((((((((()(
Andrew(Ng(
19

Algorithm(evalua-on(
Fit(model((((((((((on(training(set(
On(a(cross(valida-on/test(example(((((,(predict(

Possible(evalua-on(metrics:(
(X(True(posi-ve,(false(posi-ve,(false(nega-ve,(true(nega-ve(
(X(Precision/Recall( These would be ways to evaluate an anomaly detection
algorithm on your cross validation set or on your test set.
(X(F1Xscore(

Can(also(use(cross(valida-on(set(to(choose(parameter((
Andrew(Ng(
20

Try many different values of epsilon, and then pick the value of epsilon that, let's
say, maximizes f1-score, or that otherwise does well on your cross validation set.
Anomaly(
detec-on(
Anomaly(detec-on(
vs.(supervised(
learning(
Machine(Learning(
22

Anomaly(detec-on( vs.( Supervised(learning(


Very(small(number(of(posi-ve( Large(number(of(posi-ve(and(
examples((((((((((((().((0X20(is( nega-ve(examples.(
common).( (
Large(number(of(nega-ve((((((((((((()( (
examples.( (
Many(different(“types”(of( Enough(posi-ve(examples(for(
anomalies.(Hard(for(any(algorithm( algorithm(to(get(a(sense(of(what(
to(learn(from(posi-ve(examples( posi-ve(examples(are(like,((future(
what(the(anomalies(look(like;( posi-ve(examples(likely(to(be(
future(anomalies(may(look(nothing( similar(to(ones(in(training(set.(
like(any(of(the(anomalous(
examples(we’ve(seen(so(far.(
Andrew(Ng(
23

Anomaly(detec-on( vs.( Supervised(learning(


•  Fraud(detec-on( •  Email(spam(classifica-on(
( (
•  Manufacturing((e.g.(aircra9( •  Weather(predic-on((sunny/
engines)( rainy/etc).(
( (
•  Monitoring(machines(in(a(data( •  Cancer(classifica-on(
center(

If you have equal numbers of positive and


negative examples, then we would tend to treat
all of these as supervisor problems.

Andrew(Ng(
24

For many other problems that are faced by various technology


companies and so on, we actually are in the settings where we have
very few or sometimes zero positive training examples.

There's just so many different types of anomalies that we've never


seen them before. And for those sorts of problems, very often the
algorithm that is used is an anomaly detection algorithm.
Anomaly(
detec-on(
Choosing(what(
features(to(use(
Machine(Learning(
26

This looks vaguely Gaussian

If this is what the data looks like, what often will


be done is playing with different
transformations of the data in order to make it
look more Gaussian. The algorithm will usually
work okay, even if you don't. But if you use
these transformations to make your data more
gaussian, it might work a bit better.
27

all of these are


examples of
parameters that
you can play with in
order to make your
data look a little bit
more Gaussian
35
28

So to summarize, if you plot a histogram with the


data, and find that it looks pretty non-Gaussian, it's
worth playing around a little bit with different
transformations like these, to see if you can make
your data look a little bit more Gaussian, before you
feed it to your learning algorithm.
36
29

Non?gaussian(features(
37
30

How do you come up with features for an anomaly detection


algorithm?
Similar to the error analysis for supervised learning.
You would train a complete algorithm, and run the
algorithm on a CV set, and look at the examples it
gets wrong, and see if you can come up with extra
features to help the algorithm do better on the
examples that it got wrong in the CV set.
40
31
33

Monitoring(computers(in(a(data(center(
Choose(features(that(might(take(on(unusually(large(or(
small(values(in(the(event(of(an(anomaly.(
(=(memory(use(of(computer(
(=(number(of(disk(accesses/sec(
(=(CPU(load(
(=(network(traffic(
(

You might also like