You are on page 1of 71

CH 1:

INTRODUCTION TO
MACHİNE LEARNİNG
2
3
What is ML

4
5
6
7
8
9
 Learning algorithms are useful in many tasks
 1. Classification: Determine which discrete category
10 the example is
2. Recognizing patterns: Speech Recognition, facial identity,
etc
11
3. Recommender Systems: Noisy data, commercial pay-
off (e.g., Amazon, Netflix).
12
4. Information retrieval: Find documents or
images with similar content
13
5. Computer vision: detection, segmentation,
depth estimation, optical flow, etc
14
6. Robotics: perception, planning, etc
15
Other
7. Recognizing anomalies: Unusual sequences of credit card
16 transactions, panic situation at an airport

8. Spam filtering, fraud detection: The enemy adapts so we must


adapt too
9. Many more! Any …. Think
Big Data
 Widespread use of personal computers and wireless
17
communication IoT….. leads to “big data”

 We are both producers and consumers of data

 Data is not random, it has structure, e.g., customer behavior

 We need “big theory” to extract that structure from data


for
(a) Understanding the process
(b) Making predictions for the future
Big Data Source and applications
18
 Retail: Market basket analysis, Customer relationship
management (CRM)

 Finance: Credit scoring, fraud detection

 Manufacturing: Control, robotics

 Medicine: Medical diagnosis

 Telecommunications: Spam filters, intrusion detection

 Bioinformatics: DNA, biological data

 Web mining: Search engines


 ...
ML & Data Mining
Data is what we collect and store, and knowledge is what helps us
to make informed decisions.

The extraction of knowledge from data is called data mining.

Data mining can also be defined as the exploration and analysis of


large quantities of data in order to discover meaningful patterns
and rules.

The ultimate goal of data mining is to discover knowledge.


Why Data Mining Not Traditional Data Analysis?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
◼ Automated data collection tools, database systems, Web, computerized society

 Major sources of abundant data


◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube

 knowledge!

 Tremendous amount of data


 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data

 Data streams and sensor data


 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
Knowledge Discovery (KDD) Process

Pattern Evaluation
– Data mining—core of
knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
KDD Process: Several Key Steps

Learning the application domain


– relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)

Data reduction and transformation


– Find useful features, dimensionality/variable
reduction, invariant representation
KDD Process: Several Key Steps
 Choosing functions of data mining
 summarization, classification, regression, association,
clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant
patterns, etc.
 Use of discovered knowledge
Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowl
edge-
Data Mining Engine
Base

Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Warehouse Web Repositories
Data warehouse
Modern organisations must respond quickly to
any change in the market. This requires rapid
access to current data normally stored in
operational databases.

However, an organisation must also determine


which trends are relevant. This task is
accomplished with access to historical data that
are stored in large databases called data
warehouses.
The main characteristic of a data warehouse
is its capacity. A data warehouse is really big
– it includes millions, even billions, of data
records.
The data stored in a data warehouse is

time dependent – linked together by the


times of recording – and
integrated – all relevant information from
the operational databases is combined
and structured in the warehouse.
How is data mining applied in practice?
Many companies use data mining today, but
refuse to talk about it.
In direct marketing, data mining is used for
targeting people who are most likely to buy
certain products and services.
In trend analysis, it is used to determine trends in
the marketplace, for example, to model the stock
market. In fraud detection, data mining is used
to identify insurance claims, cellular phone calls
and credit card purchases that are most likely to
be fraudulent.
Data mining tools
Data mining is based on intelligent
technologies already discussed. It often
applies such tools as neural networks and
neuro-fuzzy systems.

However, the most popular tool used for


data mining is a decision tree.
Example: Decision trees
A decision tree can be defined as a map of
the reasoning process. It describes a data set
by a tree-like structure.

Decision trees are particularly good at


solving classification problems.
A decision tree consists of nodes, branches and
leaves.
The top node is called the root node. The tree
always starts from the root node and grows down
by splitting the data at each level into new nodes.
The root node contains the entire data set (all data
records), and child nodes hold respective subsets
of that set.
All nodes are connected by branches.
Nodes that are at the end of branches are called
terminal nodes, or leaves.
An example of a decision tree
Household
responded: 112
not responded: 888
Total: 1000

Homeownership

Yes No
responded: 9 responded: 103
not responded: 334 not responded: 554
Total: 343 Total: 657

Household Income

$20,700 $20,701
responded: 14 responded: 89
not responded: 158 not responded: 396
Total: 172 Total: 485

Savings Accounts

Yes No
responded: 86 responded: 3
not responded: 188 not responded: 208
Total: 274 Total: 211
Case study:
Profiling people with high blood
pressure
A typical task for decision trees is to
determine conditions that may lead to
certain outcomes.

Blood pressure can be categorised as


optimal, normal or high. Optimal pressure
is below 120/80, normal is between 120/80
and 130/85, and a hypertension is
diagnosed when blood pressure is over
140/90.
A data set for a hypertension study
Community Health Survey: Hypertension Study (California, U.S.A.)
Gender  Male
Female
Age 18 – 34 years
35 – 50 years
 51 – 64 years
65 or more years
Race  Caucasian
African American
Hispanic
Asian or Pacific Islander
Marital Status Married
Separated
 Divorced
Widowed
Never Married
Household Income Less than $20,700
$20,701 − $45,000
 $45,001 − $75,000
$75,001 and over
A data set for a hypertension study
(continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Alcohol Consumption Abstain from alcohol
Occasional (a few drinks per month)
 Regular (one or two drinks per day)
Heavy (three or more drinks per day)
Smoking Nonsmoker
1 – 10 cigarettes per day
 11 – 20 cigarettes per day
More than one pack per day
Caffeine Intake Abstain from coffee
 One or two cups per day
Three or more cups per day
Salt Intake Low-salt diet
 Moderate-salt diet
High-salt diet
Physical Activities None
 One or two times per week
Three or more times per week
Weight 170 cm
Height 9 3 kg
0

Blood Pressure Optimal


Normal
 High
Data cleaning
Decision trees are as good as the data they
represent. Unlike neural networks and
fuzzy systems, decision trees do not
tolerate noisy and polluted data. Therefore,
the data must be cleaned before we can
start data mining.

We might find that such fields as Alcohol


Consumption or Smoking have been left
blank or contain incorrect information.
Data enriching
From such variables as weight and height we
can easily derive a new variable, obesity.
This variable is calculated with a body-
mass index (BMI), that is, the weight in
kilograms divided by the square of the
height in metres. Men with BMIs of 27.8 or
higher and women with BMIs of 27.3 or
higher are classified as obese.
A data set for a hypertension study
(continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Obesity  Obese
Not Obese
Growing a decision tree
Blood Pressure
optimal: 319 (32%)
normal: 528 (53%)
high: 153 (15%)
Total: 1000

Age

18 – 34 years 35 – 50 years 51 – 64 years 65 or more years


optimal: 88 (56%) optimal: 208 (35%) optimal: 21 (12%) optimal: 2 (3%)
normal: 64 (41%) normal: 340 (57%) normal: 90 (52%) normal: 34 (46%)
high: 5 (3%) high: 48 (8%) high: 62 (36%) high: 38 (51%)
Total: 157 Total: 596 Total: 173 Total: 74
Growing a decision tree (continued)
51 – 64 years
optimal: 21 (12%)
normal: 90 (52%)
high: 62 (36%)
Total: 173

Obesity

Obese Not Obese


optimal: 3 (3%) optimal: 18 (27%)
normal: 53 (49%) normal: 37 (56%)
high: 51 (48%) high: 11 (17%)
Total: 107 Total: 66
Growing a decision tree (continued)
Obese
optimal: 3 (3%)
normal: 53 (49%)
high: 51 (48%)
Total: 107

Race

Caucasian African American Hispanic Asian


optimal: 2 (5%) optimal: 0 (0%) optimal: 0 (0%) optimal: 1 (12%)
normal: 24 (55%) normal: 13 (35%) normal: 11 (58%) normal: 5 (63%)
high: 17 (40%) high: 24 (65%) high: 8 (42%) high: 2 (25%)
Total: 43 Total: 37 Total: 19 Total: 8
Solution space of the hypertension
study
The solution space is first divided into four
rectangles by age, then age group 51-64 is
further divided into those who are
overweight and those who are not. And
finally, the group of obese people is divided
by race.
Hypertension study: forcing a split
Blood Pressure
optimal: 319 (32%)
normal: 528 (53%)
high: 153 (15%)
Total: 1000

Age

18 – 34 years 35 – 50 years 51 – 64 years 65 or more years


optimal: 88 (56%) optimal: 208 (35%) optimal: 21 (12%) optimal: 2 (3%)
normal: 64 (41%) normal: 340 (57%) normal: 90 (52%) normal: 34 (46%)
high: 5 (3%) high: 48 (8%) high: 62 (36%) high: 38 (51%)
Total: 157 Total: 596 Total: 173 Total: 74

Gender Gender

Male Female Male Female


optimal: 111 (36%) optimal: 97 (34%) optimal: 11 (13%) optimal: 10 (12%)
normal: 168 (55%) normal: 172 (59%) normal: 48 (56%) normal: 42 (48%)
high: 28 (9%) high: 20 (7%) high: 27 (31%) high: 35 (40%)
Total: 307 Total: 289 Total: 86 Total: 87
46
47
What is Machine Learning?
48

 Optimize a performance criterion using example


data or past experience.

 Role of Statistics: Inference from a sample

 Role of Computer science: Efficient algorithms to


 Solve the optimization problem
 Representing and evaluating the model for inference
49
Supervised Learning
 To learn an unknown target function f
 Input: a training set of labeled examples (xj,yj)
where yj = f(xj)
◼ E.g., xj is an image, f(xj) is the label “giraffe”
◼ E.g., xj is a seismic signal, f(xj) is the label “explosion”

 Output: hypothesis h that is “close” to f, i.e., predicts


well on unseen examples (“test set”)
 Many possible hypothesis families for h
 Linear models, logistic regression, neural networks, decision
trees, examples (nearest-neighbor), etc
Classification
52

 Example: Credit
scoring
 Differentiating
between low-risk and
high-risk customers
from their income and
savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
Classification: Applications
53

 Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Medical diagnosis: From symptoms to illnesses
 Biometrics: Recognition/authentication using physical
and/or behavioral characteristics: Face, iris,
signature, etc
 Outlier/novelty detection:
Face Recognition
54
Training examples of a person

Test images
Regression

 Example: Price of a
used car
y = wx+w0
 x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters

55
56
57
Real data are messy
58
Application: satellite image analysis
59
Application: Discovering DNA motifs
60...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG
TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA
TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT
CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG
GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT
AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT
GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA
GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA
GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC
TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA...

CS 194-10 Fall 2011, Stuart Russell Lecture 1 8/25/11


Application: social network analysis
61

HP Labs email data


500 users, 20k connections
evolving over time
Supervised Learning: Uses
62

 Prediction of future cases: Use the rule to predict


the output for future inputs
 Knowledge extraction: The rule is easy to
understand
 Compression: The rule is simpler than the data it
explains
 Outlier detection: Exceptions that are not covered
by the rule, e.g., fraud
Unsupervised Learning
63

 Learning “what normally happens”


 No output
 Clustering: Grouping similar instances
 Example applications
 Customer segmentation in CRM
 Image compression: Color quantization

 Bioinformatics: Learning motifs


64
Reinforcement Learning
66

 Learning a policy: A sequence of outputs


 No supervised output but delayed reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...
Reinforcement Learning
• it is field of machine learning derived from behavioral psychology,

• its applied on virtual reality, games to better deal with the environment

• In dynamic programming to solve complex problems by dividing the it into


branches and trying to solve each branch on its own to make the process
simpler

• It is a learning through interaction with the environment, where learning is


through sequential events
• The basic RL model consists of a set of “environment states - S”, a group of
“action-A” events, and a set of equivalent scores “Rewards-R”

• Where the decision-making interacts with the environment a lot, in order


to increase the reward.
69
ML of EHR

You might also like