CH 1

CH 1:
INTRODUCTION TO
MACHİNE LEARNİNG
2
3
What is ML
4
5
6
7
8
9
 Learning algorithms are useful in many tasks
 1. Classification: Determine which discrete category
10 the example is
2. Recognizing patterns: Speech Recognition, facial identity,
etc
11
3. Recommender Systems: Noisy data, commercial pay-
off (e.g., Amazon, Netflix).
12
4. Information retrieval: Find documents or
images with similar content
13
5. Computer vision: detection, segmentation,
depth estimation, optical flow, etc
14
6. Robotics: perception, planning, etc
15
Other
7. Recognizing anomalies: Unusual sequences of credit card
16 transactions, panic situation at an airport
8. Spam filtering, fraud detection: The enemy adapts so we must

adapt too
9. Many more! Any …. Think
Big Data
 Widespread use of personal computers and wireless
17
communication IoT….. leads to “big data”
 We are both producers and consumers of data
 Data is not random, it has structure, e.g., customer behavior
 We need “big theory” to extract that structure from data

for
(a) Understanding the process
(b) Making predictions for the future
Big Data Source and applications
18
 Retail: Market basket analysis, Customer relationship
management (CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Control, robotics
 Medicine: Medical diagnosis
 Telecommunications: Spam filters, intrusion detection
 Bioinformatics: DNA, biological data
 Web mining: Search engines

 ...
ML & Data Mining
Data is what we collect and store, and knowledge is what helps us
to make informed decisions.
The extraction of knowledge from data is called data mining.
Data mining can also be defined as the exploration and analysis of

large quantities of data in order to discover meaningful patterns
and rules.
The ultimate goal of data mining is to discover knowledge.

Why Data Mining Not Traditional Data Analysis?
 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability
◼ Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data

◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube
 knowledge!
 Tremendous amount of data

 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data

 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
Knowledge Discovery (KDD) Process
Pattern Evaluation
– Data mining—core of
knowledge discovery
process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
KDD Process: Several Key Steps
Learning the application domain

– relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)
Data reduction and transformation

– Find useful features, dimensionality/variable
reduction, invariant representation
KDD Process: Several Key Steps
 Choosing functions of data mining
 summarization, classification, regression, association,
clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant
patterns, etc.
 Use of discovered knowledge
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowl
edge-
Data Mining Engine
Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info

Database Warehouse Web Repositories
Data warehouse
Modern organisations must respond quickly to
any change in the market. This requires rapid
access to current data normally stored in
operational databases.
However, an organisation must also determine

which trends are relevant. This task is
accomplished with access to historical data that
are stored in large databases called data
warehouses.
The main characteristic of a data warehouse
is its capacity. A data warehouse is really big
– it includes millions, even billions, of data
records.
The data stored in a data warehouse is
time dependent – linked together by the

times of recording – and
integrated – all relevant information from
the operational databases is combined
and structured in the warehouse.
How is data mining applied in practice?
Many companies use data mining today, but
refuse to talk about it.
In direct marketing, data mining is used for
targeting people who are most likely to buy
certain products and services.
In trend analysis, it is used to determine trends in
the marketplace, for example, to model the stock
market. In fraud detection, data mining is used
to identify insurance claims, cellular phone calls
and credit card purchases that are most likely to
be fraudulent.
Data mining tools
Data mining is based on intelligent
technologies already discussed. It often
applies such tools as neural networks and
neuro-fuzzy systems.
However, the most popular tool used for

data mining is a decision tree.
Example: Decision trees
A decision tree can be defined as a map of
the reasoning process. It describes a data set
by a tree-like structure.
Decision trees are particularly good at

solving classification problems.
A decision tree consists of nodes, branches and
leaves.
The top node is called the root node. The tree
always starts from the root node and grows down
by splitting the data at each level into new nodes.
The root node contains the entire data set (all data
records), and child nodes hold respective subsets
of that set.
All nodes are connected by branches.
Nodes that are at the end of branches are called
terminal nodes, or leaves.
An example of a decision tree
Household
responded: 112
not responded: 888
Total: 1000
Homeownership
Yes No
responded: 9 responded: 103
not responded: 334 not responded: 554
Total: 343 Total: 657
Household Income
$20,700 $20,701
Savings Accounts
Yes No
Case study:
Profiling people with high blood
pressure
A typical task for decision trees is to
determine conditions that may lead to
certain outcomes.
Blood pressure can be categorised as

optimal, normal or high. Optimal pressure
is below 120/80, normal is between 120/80
and 130/85, and a hypertension is
diagnosed when blood pressure is over
140/90.
A data set for a hypertension study
Community Health Survey: Hypertension Study (California, U.S.A.)
Gender  Male
Female
Age 18 – 34 years
35 – 50 years
 51 – 64 years
65 or more years
Race  Caucasian
African American
Hispanic
Asian or Pacific Islander
Marital Status Married
Separated
 Divorced
Widowed
Never Married
Household Income Less than $20,700
$20,701 − $45,000
 $45,001 − $75,000
$75,001 and over
(continued)
Alcohol Consumption Abstain from alcohol
Occasional (a few drinks per month)
 Regular (one or two drinks per day)
Heavy (three or more drinks per day)
Smoking Nonsmoker
1 – 10 cigarettes per day
 11 – 20 cigarettes per day
More than one pack per day
Caffeine Intake Abstain from coffee
 One or two cups per day
Three or more cups per day
Salt Intake Low-salt diet
 Moderate-salt diet
High-salt diet
Physical Activities None
 One or two times per week
Three or more times per week
Weight 170 cm
Height 9 3 kg
0
Blood Pressure Optimal

Normal
 High
Data cleaning
Decision trees are as good as the data they
represent. Unlike neural networks and
fuzzy systems, decision trees do not
tolerate noisy and polluted data. Therefore,
the data must be cleaned before we can
start data mining.
We might find that such fields as Alcohol

Consumption or Smoking have been left
blank or contain incorrect information.
Data enriching
From such variables as weight and height we
can easily derive a new variable, obesity.
This variable is calculated with a body-
mass index (BMI), that is, the weight in
kilograms divided by the square of the
height in metres. Men with BMIs of 27.8 or
higher and women with BMIs of 27.3 or
higher are classified as obese.
(continued)
Obesity  Obese
Not Obese
Growing a decision tree
Blood Pressure
optimal: 319 (32%)
normal: 528 (53%)
high: 153 (15%)
Total: 1000
Age
18 – 34 years 35 – 50 years 51 – 64 years 65 or more years

optimal: 88 (56%) optimal: 208 (35%) optimal: 21 (12%) optimal: 2 (3%)
normal: 64 (41%) normal: 340 (57%) normal: 90 (52%) normal: 34 (46%)
high: 5 (3%) high: 48 (8%) high: 62 (36%) high: 38 (51%)
Total: 157 Total: 596 Total: 173 Total: 74
Growing a decision tree (continued)
51 – 64 years
optimal: 21 (12%)
normal: 90 (52%)
high: 62 (36%)
Total: 173
Obesity
Obese Not Obese

optimal: 3 (3%) optimal: 18 (27%)
normal: 53 (49%) normal: 37 (56%)
high: 51 (48%) high: 11 (17%)
Growing a decision tree (continued)
Obese
optimal: 3 (3%)
normal: 53 (49%)
high: 51 (48%)
Total: 107
Race
Caucasian African American Hispanic Asian

Solution space of the hypertension
study
The solution space is first divided into four
rectangles by age, then age group 51-64 is
further divided into those who are
overweight and those who are not. And
finally, the group of obese people is divided
by race.
Hypertension study: forcing a split
Blood Pressure
optimal: 319 (32%)
normal: 528 (53%)
high: 153 (15%)
Total: 1000
Age
18 – 34 years 35 – 50 years 51 – 64 years 65 or more years

Gender Gender
Male Female Male Female

46
47
What is Machine Learning?
48
 Optimize a performance criterion using example

data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to

 Solve the optimization problem
 Representing and evaluating the model for inference
49
Supervised Learning
 To learn an unknown target function f
 Input: a training set of labeled examples (xj,yj)
where yj = f(xj)
◼ E.g., xj is an image, f(xj) is the label “giraffe”
◼ E.g., xj is a seismic signal, f(xj) is the label “explosion”
 Output: hypothesis h that is “close” to f, i.e., predicts

well on unseen examples (“test set”)
 Many possible hypothesis families for h
 Linear models, logistic regression, neural networks, decision
trees, examples (nearest-neighbor), etc
Classification
52
 Example: Credit
scoring
 Differentiating
between low-risk and
high-risk customers
from their income and
savings
Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk ELSE high-risk
Classification: Applications
53
 Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Medical diagnosis: From symptoms to illnesses
 Biometrics: Recognition/authentication using physical
and/or behavioral characteristics: Face, iris,
signature, etc
 Outlier/novelty detection:
Face Recognition
54
Training examples of a person
Test images
Regression
 Example: Price of a
used car
y = wx+w0
 x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
55
56
57
Real data are messy
58
Application: satellite image analysis
59
Application: Discovering DNA motifs
60...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG
TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA
TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT
CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG
GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT
AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT
GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA
GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA
GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC
TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA...
CS 194-10 Fall 2011, Stuart Russell Lecture 1 8/25/11

Application: social network analysis
61
HP Labs email data

500 users, 20k connections
evolving over time
Supervised Learning: Uses
62
 Prediction of future cases: Use the rule to predict

the output for future inputs
 Knowledge extraction: The rule is easy to
understand
 Compression: The rule is simpler than the data it
explains
 Outlier detection: Exceptions that are not covered
by the rule, e.g., fraud
Unsupervised Learning
63
 Learning “what normally happens”

 No output
 Clustering: Grouping similar instances
 Example applications
 Customer segmentation in CRM
 Image compression: Color quantization
 Bioinformatics: Learning motifs

64
Reinforcement Learning
66
 Learning a policy: A sequence of outputs

 No supervised output but delayed reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...
Reinforcement Learning
• it is field of machine learning derived from behavioral psychology,
• its applied on virtual reality, games to better deal with the environment
• In dynamic programming to solve complex problems by dividing the it into

branches and trying to solve each branch on its own to make the process
simpler
• It is a learning through interaction with the environment, where learning is

through sequential events
• The basic RL model consists of a set of “environment states - S”, a group of
“action-A” events, and a set of equivalent scores “Rewards-R”
• Where the decision-making interacts with the environment a lot, in order

to increase the reward.
69
ML of EHR

CH 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 1

Uploaded by

Copyright:

Available Formats

CH 1:

8. Spam filtering, fraud detection: The enemy adapts so we must

 We are both producers and consumers of data

 Data is not random, it has structure, e.g., customer behavior

 We need “big theory” to extract that structure from data

 Finance: Credit scoring, fraud detection

 Manufacturing: Control, robotics

 Medicine: Medical diagnosis

 Telecommunications: Spam filters, intrusion detection

 Bioinformatics: DNA, biological data

 Web mining: Search engines

The extraction of knowledge from data is called data mining.

Data mining can also be defined as the exploration and analysis of

The ultimate goal of data mining is to discover knowledge.

 The Explosive Growth of Data: from terabytes to petabytes

 Major sources of abundant data

 Tremendous amount of data

 Data streams and sensor data

Data Warehouse Selection

Learning the application domain

Data reduction and transformation

Graphical User Interface

data cleaning, integration, and selection

Data World-Wide Other Info

However, an organisation must also determine

time dependent – linked together by the

However, the most popular tool used for

Decision trees are particularly good at

Blood pressure can be categorised as

Blood Pressure Optimal

We might find that such fields as Alcohol

18 – 34 years 35 – 50 years 51 – 64 years 65 or more years

Obese Not Obese

Caucasian African American Hispanic Asian

18 – 34 years 35 – 50 years 51 – 64 years 65 or more years

Male Female Male Female

 Optimize a performance criterion using example

 Role of Statistics: Inference from a sample

 Role of Computer science: Efficient algorithms to

 Output: hypothesis h that is “close” to f, i.e., predicts

Discriminant: IF income > θ1 AND savings > θ2

CS 194-10 Fall 2011, Stuart Russell Lecture 1 8/25/11

HP Labs email data

 Prediction of future cases: Use the rule to predict

 Learning “what normally happens”

 Bioinformatics: Learning motifs

 Learning a policy: A sequence of outputs

• In dynamic programming to solve complex problems by dividing the it into

• It is a learning through interaction with the environment, where learning is

• Where the decision-making interacts with the environment a lot, in order

You might also like