Slide 1 - Introduction To ML

Introduction
Welcome
Machine Learning
Machine learning is one of the
most exciting recent
technologies.
You probably use it dozens of times a day
without even knowing it

Every time you use a web search
engine like Google to
search the internet, one of the
reasons that works so well is
because a learning algorithm,
one implemented by Google or
Microsoft, has learned how to
rank web pages.
Every time you use Facebook
or Apple's photo typing
application and
It recognizes your friends'

photos, that's also
machine learning.
Every time you read
your email and your spam filter
saves you from having to wade
through tons of spam email,
that's also a learning algorithm.

The AI dream of someday building machines
as intelligent as you or me.
We're a long way away from that goal, but many AI

researchers believe that the best way to towards
that goal is through learning algorithms that try
to mimic how the human brain learns.
So why is machine learning so prevalent today?

It turns out that machine learning is a field
that had grown out of the field of AI,
Machine Learning
- Grew out of work in AI
- New capability for computers
Examples:
- Database mining
Large datasets from growth of automation/web.
E.g., Web click data, medical records, biology, engineering
- Applications can’t program by hand.
E.g., Autonomous helicopter, handwriting recognition, most of
Natural Language Processing (NLP), Computer Vision.
Andrew Ng
Machine Learning
Examples:
- Database mining
Andrew Ng
Machine Learning
Examples:
- Database mining
- Self-customizing programs
E.g., Amazon, Netflix product recommendations
- Understanding human learning (brain, real AI).
Andrew Ng
Introduction
What is machine
learning
Machine Learning
Andrew Ng
Machine Learning definition
Even among machine
learning practitioners,
there isn't a well
accepted definition of
what is and what isn't
machine learning.
Andrew Ng
Machine Learning
• Arthur Lee Samuel was an American pioneer in
the field of computer gaming and artificial
intelligence.
• He coined the term "machine learning" in 1959.
• The Samuel Checkers-playing Program was

among the world's first successful self-learning
programs, and as such a very early
demonstration of the fundamental concept
of artificial intelligence (AI).
Wikipedia
• Arthur Samuel (1959). Machine Learning: Field of
study that gives computers the ability to learn
without being explicitly programmed.
Andrew Ng
- He wrote a checkers playing program, and the amazing thing was that

Arthur Samuel himself wasn't a very good checkers player.
- But what he did was he had to programmed maybe tens of thousands of
games against himself, and by watching what sorts of board positions
• Arthur Samuel (1959). Machine Learning: Field of
tended to lead to wins and what sort of board positions tended to lead
to losses, the checkers playing program learned over time what are good
study that gives computers the ability to learn
board positions and what are bad board positions.
without being explicitly programmed.
- And eventually learn to play checkers better than the Arthur Samuel
himself was able to. This was a remarkable result.
Andrew Ng
Arthur Samuel himself turns out not to be a very good checkers
player.
But because a computer has the patience to play tens of

thousands of games against itself, no human has the patience to
play that many games.
By doing this, a computer was able to get so much checkers

playing experience that it eventually became a better checkers
player than Arthur himself.
Tom Mitchell (1998) Well-posedLearning Problem:
Tom M. Mitchell
• Tom Michael Mitchell is an American
computer scientist and University
Professor at the Carnegie Mellon
University (CMU).
• He is a former Chair of the Machine

Learning Department at CMU.
• Mitchell is known for his contributions to

the advancement of :
• machine learning, artificial intelligence,
and cognitive neuroscience and is the
author of the textbook Machine Learning.
Wikipedia
For the checkers playing examples:
— The experience E would be the experience of having the

program play tens of thousands of games itself.
—The task T would be the task of playing checkers,
— and the performance measure P will be the probability

that wins the next game of checkers against some new
opponent.
“A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P,
improves with experience E.”
Quiz::: Suppose your email program watches which emails you do or do not mark
as spam, and based on that learns how to better filter spam. What is the task T in
this setting?
1) Classifying emails as spam or not spam.
2) Watching you label emails as spam or not spam.
3) The number (or fraction) of emails correctly classified as spam/not spam.
4) None of the above—this is not a machine learning problem.

“A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E.”
Suppose your email program watches which emails you do or do
not mark as spam, and based on that learns how to better filter
spam. What is the task T in this setting?
Classifying emails as spam or not spam.
Watching you label emails as spam or not spam.
The number (or fraction) of emails correctly classified as spam/not spam.
None of the above—this is not a machine learning problem.

“A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E.”
Suppose your email program watches which emails you do or do
not mark as spam, and based on that learns how to better filter
spam. What is the task T in this setting?
Classifying emails as spam or not spam.
Watching you label emails as spam or not spam.
The number (or fraction) of emails correctly classified as spam/not spam.
None of the above—this is not a machine learning problem.

Machine learning algorithms:
Teach the computer how to do something
- Supervised learning
- Unsupervised learning Let it learn by itself
Others: Reinforcement learning, recommender
systems.
Also talk about: Practical advice for applying

learning algorithms.
Andrew Ng
Introduction
Supervised
Learning
Machine Learning
Andrew Ng
Let's say you plot the data set and it looks like this.
Here on the horizontal axis, the size of different houses in square feet,
and on the vertical axis, the price of different houses in thousands of dollars.
So, given this data, let's say you have a friend who owns a house that is say 750 square feet,
and they are hoping to sell the house, and they want to know how much they can get for the
house. So, how can the learning algorithm help you?
Housing price prediction.
400 1) a straight line through the data,
2) may be it's better to fit a

300
quadratic function, or a second-
Price ($) order polynomial to this data
200
in 1000’s
100 how to choose, and how to decide?
we gave the algorithm 0

a data set with the 0 500 1000 1500 2000 2500
"right answers”and we're trying to predict a
we told it what is the
Size in feet2 continuous valued
right price. output.
Supervised Learning Regression: Predict continuous

“right answers” given valued output (price)
each of these would be a fine example of a learning algorithm Andrew Ng
Let's say you want to look at medical records and try to predict of a breast cancer as
malignant or benign.
If someone discovers a breast tumor, a lump in their breast, a malignant tumor is a tumor that is
harmful and dangerous, and a benign tumor is a tumor that is harmless. So obviously, people care a
lot about this.
Let's see collected data set. Suppose you are in your dataset, you have on your horizontal axis the
size of the tumor, and on the vertical axis, one or zero, yes or no, whether or not these are examples
of tumors we've seen before are malignant, which is one, or zero or not malignant or benign.
Breast cancer (malignant, benign)
Classification
1(Y)
Discrete valued
Malignant?
output (0 or 1)
0(N)
Tumor Size
In classification
problems, there is
another way to
plot this data. In
this example we
have One feature Tumor Size
The Machine Learning question is, can you estimate what is the probability, what's the chance that a tumor
as malignant versus benign?
Andrew Ng
- Clump Thickness
- Uniformity of Cell Size
Age - Uniformity of Cell Shape
…
Tumor Size one of the most interesting learning algorithms that

we'll see in this course, as the learning algorithm
that can deal with not just two, or three, or five
features, but an infinite number of features.
Andrew Ng
QUIZ:: You’re running a company, and you want to develop learning algorithms to
address each of two problems.
—Problem 1: You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
—Problem 2: You’d like software to examine individual customer accounts, and for each
account decide if it has been hacked/compromised.
Should you treat these as classification or as regression problems?
1) Treat both as classification problems.
2)Treat problem 1 as a classification problem, problem 2 as a regression problem.
3)Treat problem 1 as a regression problem, problem 2 as a classification problem.
4)Treat both as regression problems.

Statistical Learning Problems
ØIdentify the risk factors for prostate cancer.
ØClassify a recorded phoneme based on a log-periodogram.
ØPredict whether someone will have a heart attack on the basis of
demographic, diet and clinical measurements.
ØCustomize an email spam detection system.
ØIdentify the numbers in a handwritten zip code.
ØClassify a tissue sample into one of several cancer classes, based on
a gene expression profile.
ØEstablish the relationship between salary and demographic
variables in population survey data.
ØClassify the pixels in a LANDSAT image, by usage.
Statistical Learning, Trevor Hastie and Robert Tibshirani. Stanford Online
Philosophy
vIt is important to understand the ideas behind the various techniques, in order to
know how and when to use them.
vOne has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
vIt is important to accurately assess the performance of a method, to know how

well or how badly it is working [simpler methods often perform as well as fancier
ones!]
vThis is an exciting research area, having important applications in science, industry

and finance.
vStatistical learning is a fundamental ingredient in the training of a modern data

scientist.
Introduction
Unsupervised
Learning
Machine Learning
Andrew Ng
Supervised Learning
x2
x1
Andrew Ng
Unsupervised Learning
x2
Unsupervised
Learning algorithm Data that doesn't have any labels
might decide that Can you find some structure in the data?
the data lives in two
different clusters.
This is called
Clustering Algorithme x1
Andrew Ng
Unsupervised Learning
• Another important class of problems involves situations in which we only observe
input variables, with no corresponding output.
• No outcome variable, just a set of predictors (features) measured on a set of

samples.
• Objective is more fuzzy — find groups of samples that behave similarly, find
features that behave similarly, find linear combinations of features with the most
variation.
• Difficult to know how well your are doing.
• different from supervised learning, but can be useful as a pre-processing step for
unsupervised learning.
• In a marketing setting, we might have
demographic information for a number of
current or potential customers.
• We may wish to understand which types

of customers are similar to each other by
grouping individuals according to their
observed characteristics.
• This is known as a clustering problem.
• Unlike in the previous examples, here

we are not trying to predict an output
variable.
An Introduction to Statistical Learning. With applications in R. Springer
Google News
Andrew Ng
What Google News
does is everyday it
goes and looks at tens
of thousands or
hundreds of thousands
of new stories on the
web and automatically
cluster them together.
Andrew Ng
Genes
Individuals
Here's an example of DNA microarray data. The idea is put a group of different individuals and for
each of them, you measure how much they do or do not have a certain gene. Technically you
measure how much certain genes are expressed. So these colors, red, green, gray and so on, they
show the degree to which different individuals do or do not have a specific gene.
[Source: Daphne Koller] Andrew Ng
Genes
Individuals
Here's a bunch of data. the different types of people are unknown. The unsupervised learning
automatically find structure in the data from the automatically cluster the individuals into these
types that we don't know in advance
[Source: Daphne Koller] Andrew Ng

Organize computing clusters Social network analysis
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
Market segmentation Astronomical data analysis

Andrew Ng
The
Cocktail
Party
Problem
The Cocktail Party

Problem. BrainFacts
• The essence of the cocktail party problem can be formulated as a
deceptively simple question :
• “How do we recognize what one person is saying when others are

speaking at the same time?”
• Finding answers to this question has been an important goal of

human hearing research for several decades.
• At the root of the cocktail party problem is the fact that the human
voices present in a noisy social setting often overlap in frequency
and in time.
The “Cocktail Party Problem”: What Is It? How Can It Be Solved? And Why Should Animal Behaviorists Study It?. NCBI
Cocktail party problem
Each microphone
records a different
combination of
these two speaker
voices.
Speaker #1 Microphone #1
Each microphone
would cause an
overlapping
combination of
both speakers'
voices.
Speaker #2 Microphone #2
Andrew Ng
Microphone #1: Output #1:
[Audio clips courtesy of Te-Won Lee.] Andrew Ng

Cocktail party problem algorithm
These two microphone recorders are given to an Unsupervised Learning

[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
algorithm called the cocktail party algorithm, and the algorithm try to find
structure in this data.
What that the cocktail party algorithm will do is separate out these two audio
sources that were being added or being summed together to form other
recordings.
[Source: Sam Roweis, Yair Weiss & Eero Simoncelli] Andrew Ng

QUIZ::: Of the following examples, which would you address using an
unsupervised learning algorithm? (Check all that apply.)
1) Given email labeled as spam/not spam, learn a spam filter.
2) Given a set of news articles found on the web, group them into set of
articles about the same story.
3) Given a database of customer data, automatically discover market

segments and group customers into different market segments.
4) Given a dataset of patients diagnosed as either having diabetes or not,

learn to classify new patients as having diabetes or not.
Of the following examples, which would you address using an
unsupervised learning algorithm? (Check all that apply.)
Given email labeled as spam/not spam, learn a spam filter.
Given a set of news articles found on the web, group them into
set of articles about the same story.
Given a database of customer data, automatically discover market
segments and group customers into different market segments.
Given a dataset of patients diagnosed as either having diabetes or
not, learn to classify new patients as having diabetes or not.
A Brief History of Statistical Learning
A Brief History of Statistical Learning
• Though the term statistical learning is fairly new, many of the concepts that underlie the
field were developed long ago.
• At the beginning of the nineteenth century, Legendre and Gauss published papers on the
method of least squares which implemented the earliest form of what is now known as
linear regression
• The approach was first successfully applied to problems in astronomy. Linear regression is
used for predicting quantitative values, such as an individual’s salary. In order to predict
qualitative values, such as whether a patient survives or dies, or whether the stock market
increases or decreases, Fisher proposed linear discriminant analysis in 1936.
• In the 1940s, various authors put forth an alternative approach, logistic regression
• In the early 1970s, Nelder and Wedderburn coined the term generalized linear models for
an entire class of statistical learning methods that include both linear and logistic
regression as special cases. An Introduction to Statistical Learning. With applications in R. Springer
• By the end of the 1970s, many more techniques for learning from data were
available. However, they were almost exclusively linear methods, be-cause
fitting non-linear relationships was computationally infeasible at the time.
• By the 1980s, computing technology had finally improved sufficiently that

non-linear methods were no longer computationally prohibitive. In
mid1980s Breiman, Friedman, Olshen and Stone introduced classification
and regression trees and were among the first to demonstrate the power of
a detailed practical implementation of a method, including cross-validation
for model selection.
• In recent years, progress in statistical learning has been marked by the

increasing availability of powerful and relatively user-friendly software.
An Introduction to Statistical Learning. With applications in R. Springer

Slide 1 - Introduction To ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 1 - Introduction To ML

Uploaded by

Copyright:

Available Formats

Introduction

You probably use it dozens of times a day

without even knowing it

It recognizes your friends'

that's also a learning algorithm.

We're a long way away from that goal, but many AI

So why is machine learning so prevalent today?

• He coined the term "machine learning" in 1959.

• The Samuel Checkers-playing Program was

Machine Learning definition

But because a computer has the patience to play tens of

By doing this, a computer was able to get so much checkers

• He is a former Chair of the Machine

• Mitchell is known for his contributions to

— The experience E would be the experience of having the

—The task T would be the task of playing checkers,

— and the performance measure P will be the probability

1) Classifying emails as spam or not spam.

2) Watching you label emails as spam or not spam.

3) The number (or fraction) of emails correctly classified as spam/not spam.

4) None of the above—this is not a machine learning problem.

Classifying emails as spam or not spam.

Watching you label emails as spam or not spam.

The number (or fraction) of emails correctly classified as spam/not spam.

None of the above—this is not a machine learning problem.

Classifying emails as spam or not spam.

Watching you label emails as spam or not spam.

The number (or fraction) of emails correctly classified as spam/not spam.

None of the above—this is not a machine learning problem.

Also talk about: Practical advice for applying

2) may be it's better to fit a

we gave the algorithm 0

Supervised Learning Regression: Predict continuous

Tumor Size one of the most interesting learning algorithms that

Should you treat these as classification or as regression problems?

1) Treat both as classification problems.

2)Treat problem 1 as a classification problem, problem 2 as a regression problem.

3)Treat problem 1 as a regression problem, problem 2 as a classification problem.

4)Treat both as regression problems.

vIt is important to accurately assess the performance of a method, to know how

vThis is an exciting research area, having important applications in science, industry

vStatistical learning is a fundamental ingredient in the training of a modern data

• No outcome variable, just a set of predictors (features) measured on a set of

• Difficult to know how well your are doing.

• We may wish to understand which types

• This is known as a clustering problem.

• Unlike in the previous examples, here

[Source: Daphne Koller] Andrew Ng

Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)

Market segmentation Astronomical data analysis

The Cocktail Party

• “How do we recognize what one person is saying when others are

• Finding answers to this question has been an important goal of

Microphone #2: Output #2:

Microphone #1: Output #1:

Microphone #2: Output #2:

[Audio clips courtesy of Te-Won Lee.] Andrew Ng

These two microphone recorders are given to an Unsupervised Learning

[Source: Sam Roweis, Yair Weiss & Eero Simoncelli] Andrew Ng

1) Given email labeled as spam/not spam, learn a spam filter.

3) Given a database of customer data, automatically discover market

4) Given a dataset of patients diagnosed as either having diabetes or not,

Given email labeled as spam/not spam, learn a spam filter.