You are on page 1of 117

Advance Statistics & Data Mining

Week 1-2

Dr. Muhammad Nadeem Majeed

nadeem.majeed@uettaxila.edu.pk

1
Course Objectives
This is a course for students on the topic of
Statistical Analysis and Data Mining. Topics include
statistical analysis, data mining applications, data
preparation, data reduction and various data
mining techniques (such as association, clustering,
classification, anomaly detection)

2
Outline
Course Logistics
Data Mining Introduction
Four Key Characteristics
Combination of Theory and Application
Engineering Process
Collection of Functionalities
Interdisciplinary field
How do we categorize data mining systems?
History of Data Mining
Research Issues
Curse of Dimensionality

3
Artificial Intelligence in Sci Fi
Artificial Intelligence in Sci Fi
Intelligence
The ability to solve the problems
Consider the following sequence
1, 3, 7, 13, 21, __
What is the next number ?
Intelligence is to reason in a logical way to
reach a conclusion
Intelligence
Ability to solve problems
Ability to plan and schedule
Ability to memorize and process information
Ability to answer fuzzy questions
Ability to learn
Ability to recognize
Ability to understand
Ability to perceive
And many more

Can only humans beings and animals possess these qualities?


But If
A machine searches through a mesh and finds a path?
A machine solves problems like the next number in
the sequence?
A machine develops plans?
A machine diagnoses and prescribes?
A machine answers ambiguous questions?
A machine recognizes fingerprints?
A machine understands?
A machine perceives?
A machine does MANY MORE SUCH THINGS
[The automation of] activities that we associate with human thinking,
activities such as decision making, problem solving, learning (Bellman,
1978)

The exciting new effort to make computers think machines with minds, in
the full and literal sense (Haugeland, 1985)

The study of computation that make it possible to perceive, reason and act
(Winston 1992)

The art of creating machines that perform functions that require intelligence
when performed by people (Kurzweil 1990)

The branch of computer science that is concerned with the automation of


intelligent behavior (Luger and Stubblefield, 1993)

9
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
14
See this Can you identify the profitable routes from Airline
reservation system?
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Artificial Intelligence VS. Human Intelligence
Can you detect fraud from transactional
data?
Vision-based biometrics

How the Afghan Girl was Identified by Her Iris Patterns Read the story
wikipedia
Google Car
AISIGHT
Where am I
SPAM
Massive volumes of data from sensors and networks of sensors

Large Synoptic
Survey
Telescope (LSST)
40TB/day
(an SDSS every two
days),

100+PB in its 10-year


lifetime
Machine Learning
- Grew out of work in AI
- New capability for computers
Examples:
- Database mining
Large datasets from growth of automation/web.
E.g., Web click data, medical records, biology,
engineering
- Applications cant program by hand.
E.g., Autonomous helicopter, handwriting
recognition, most of Natural Language
Processing (NLP), Computer Vision.
Machine Learning definition
Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.

Well-posed Learning Problem: A computer


program is said to learn from experience E with
respect to some task T and some performance
measure P, if its performance on T, as
measured by P, improves with experience E.
Why Data Mining?
Motivation: Necessity is the Mother of Invention
Data explosion problem
Applications generate huge amounts of data
WWW, computer systems/programs, biology experiments,
Business transactions, Scientific computation and simulation,
Medical and person data, Surveillance video and pictures,
Satellite sensing, Digital media,
Technologies are available to collect and store data
Bar codes, scanners, satellites, cameras etc.
Databases, data warehouses, variety of repositories
We are drowning in data, but starving for knowledge!
31
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs

Key Characteristics
Combination of Theory and Application
Engineering Process
Data Pre-processing and Post-processing, Interpretation
Collection of Functionalities
Different Tasks and Algorithms
Interdisciplinary Field
32
Real Example from NBA
AS (Advanced Scout) software from IBM Research
Coach can assess the effectiveness of certain coaching
decisions
Good/bad player matchups
Plays that work well against a given team
Raw Data: Play-by-play information recorded by
teams
Who is on court
Who took a shot, the type of shot, the outcome, any
rebounds

33
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis,
cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and detection of unusual patterns
(outliers)

34
Potential Applications
Other Applications
Text mining (news group, email, documents)
and Web mining
Stream data mining
System and Network Management
Multimedia Applications
Music, Image, Video
DNA and bio-data analysis

35
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Inventory reduction (stock only needed quantities)
Information source: Historical business data
Example: Supermarket sales records
Date/Time/Register Fish Turkey Cranberries Wine ...
12/6 13:15 2 N Y Y N ...
12/6 13:16 3 Y N N Y ...

Size ranges from 50k records (research studies) to terabytes (years of


data from chains)
Data is already being warehoused
Sample question what products are generally
purchased together?
The answers are in the data, if only we could see
them
36
Other Applications
Network System management
Event Mining Research at IBM
Astronomy
JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
37
Market Analysis and Management (1)

Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
38
Market Analysis and Management (2)
Customer profiling
data mining can tell you what types of customers buy
what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency
and variation)

39
Corporate Analysis and Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
Resource planning:
summarize and compare the resources and
spending
Competition:
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market 40
Fraud Detection and Management (1)
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
41
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Detecting telephone fraud
Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail
Analysts estimate that 38% of retail shrink is due to
dishonest employees.
42
Data Mining: An Engineering Process
Data mining: interactive and iterative process.
Interpretation/
Evaluation
Mining
Algorithms Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Target
Data

adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and 43
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge

44
Architecture of a Typical Data Mining
System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse
45
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW

46
What Can Data Mining Do?

Cluster
Classify
Categorical, Regression
Semi-supervised
Summarize
Summary statistics, Summary rules
Link Analysis / Model Dependencies
Association rules
Sequence analysis
Time-series analysis, Sequential associations
Detect Deviations
47
Learning?
Definitions of learning from dictionary:
To get knowledge of by study,
experience, or being taught
To become aware by information or
from observation
To commit to memory
To be informed of, ascertain; to receive instruction

48
Machine Learning

Machinelearning involves adaptive


mechanisms that enable computers to
learn
from experience,
learn by example
learn by analogy.

Learning
capabilities can improve the
performance of an intelligent system over
time.
49
A Generic System
x1 y1
x2 y2
System


xN h1 , h2 ,..., hK
yM

Input Variables: x = ( x1 , x2 ,..., xN )


Hidden Variables: h = ( h1 , h2 ,..., hK )
Output Variables: y = ( y1 , y2 ,..., yK )
50
Another definition
Machine Learning algorithms discover the
relationships between the variables of a system
(input, output and hidden) from direct samples
of the system
These algorithms originate form many fields:
Statistics, mathematics, theoretical computer
science, physics, neuroscience, etc

51
Past When are ML algorithms
NOT needed?
When the relationships between all system
variables (input, output, and hidden) is
completely understood!

This is NOT the case for almost any real system!

52
Machine Learning

Learning

53
Machine Learning

Supervised
Learning

Learning

54
Machine Learning

Unsupervised Supervised
Learning Learning

Learning

55
Machine Learning

Unsupervised Supervised
Learning Learning

Learning

Reinforcement
Learning
56
Carpentry

57
Machine Learning

Unsupervised Supervised
Learning Learning
Today!
Learning

Reinforcement
Learning
58
Supervised Learning

Given labeled data. Predict output.


(Learning with a teacher)

Carpentry of Supervised Learning

59
What does Data Look Like?

60
Data

M observations :
For each observation (i) we have x(i) and y(i)

61
The Data and goal
Data: A set of data records (also called examples, instances or cases)
described by
k attributes: A1, A2, Ak.
a class: Each example is labelled with a pre-defined class.
Goal: To learn a classification model from the data that can be used to
predict the classes of new (future, or test) cases/instances.

62
Supervised Learning

Training
Set

Learning
Algorithm

63
Supervised Learning

Training
Set

Learning
Algorithm

x h

64
Supervised Learning

Training
Set

Learning
Algorithm

x h predicted
y
65
The learning task
Learn a classification model from the data
Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
What is the class for following case/instance?

66
Learning the Target Function
Like human learning from past experiences.
A computer does not have experiences.
A computer system learns from data, which represent
some past experiences of an application domain.
Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g.,
approve or not-approved, and high-risk or low risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.

67
Formally, What is Learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to perform the task T if after
learning the systems performance on T improves as measured by M.
In other words, the learned model helps the system to perform T better
as compared to no learning.

68
Supervised vs. Unsupervised
Supervised learning: classification is seen as supervised learning from
examples.
Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes. It is like that a teacher gives the classes (supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the existence of classes or clusters in
the data

69
Classification Vs. Regression

Supervised Learning Input:


A description of an instance, xX, where X is the input features and C= classes
Trainings Set: {(x1,c1), (x2,c1), (x5,c3), . (x6,c2),}
Test Set: x
Supervised Learning Task:
The category of x: c(x)C, where c(x) is a
classification/Regression function
Classification:
A fixed set of Classes:
C ={c1, c2,cn}
Regression:
C = Continuous variable

70
Some Learning algorithms

*Just introduction, we will cover them

71
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances

Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...

Given a set of points from classes


what is the class of new point ?
72
Classification: Decision Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

73
Classification: Neural Nets

Can select more complex


regions
Can be more accurate
Also can overfit the data
find patterns in random noise

74
Linear Regression

Linear Regression

w0 + w1 x + w2 y >= 0
Regression computes wi from
data to minimize squared
error to fit the data
Not flexible enough

75
Examples

76
Example: The weather problem
Outlook Temperature Humidity Windy Play
sunny hot high false no
Given past data,
sunny hot high true no
Can you come up
overcast hot high false yes
with the rules for
rainy mild high false yes
rainy mild normal false yes
Play/Not Play ?
rainy mild normal true no
overcast mild normal true yes
sunny mild high false no
sunny mild normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

77
witten&eibe
The weather problem

Given this data, what are the rules for play/not play?

Outlook Temperature Humidity Windy Play


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes

78
witten&eibe
The weather problem

Conditions for playing


Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes

If outlook = sunny and humidity = high then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
79
witten&eibe
Weather data with mixed attributes
Outlook Temperature Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
80
Weather data with mixed attributes

How will the rules change when some attributes have


numeric values?

Outlook Temperature Humidity Windy Play


Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes

81
Weather data with mixed attributes

Rules with mixed attributes


Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes

If outlook = sunny and humidity > 83 then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
82
witten&eibe
The contact lenses data
Age Spectacle prescription Astigmatism Tear production rate Recommended lenses

Young Myope No Reduced None


Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None

83
witten&eibe
A complete and correct rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none

84
witten&eibe
A decision tree for this problem

85
witten&eibe
Classifying iris flowers

Sepal length Sepal width Petal length Petal width Type


1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa

51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor

101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica

If petal length < 2.45 then Iris setosa


If sepal width < 2.10 then Iris versicolor
86 ...
witten&eibe
Predicting CPU performance

Example: 209 different computer configurations

Cycle time (ns) Main memory (Kb) Cache Channels Performance


(Kb)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269

208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45

Linear regression function

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX


+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
87
witten&eibe
Soybean classification
Attribute Number of Sample value
values
Environment Time of occurrence 7 July
Precipitation 3 Above normal

Seed Condition 2 Normal
Mold growth 2 Absent

Fruit Condition of fruit pods 4 Normal
Fruit spots 5 ?
Leaves Condition 2 Abnormal
Leaf spot size 3 ?

Stem Condition 2 Abnormal
Stem lodging 2 Yes

Roots Condition 3 Normal
Diagnosis 19 Diaporthe stem canker

88
witten&eibe
Discriminative Vs. Generative
Learning approaches

89
Assumption in learning
Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples).

In practice, this assumption is often violated to certain degree.


Strong violations will clearly result in poor classification accuracy.
To achieve good accuracy on the test data, training examples must be
sufficiently representative of the test data.

90
Evaluation Methodologies

91
Cross Validation

Train Set Test Set

92
N Fold Cross Validation

Train Train Train Test Train


Set Set Set Set Set

93
Steps in Supervised Learning
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy

Number of correct classifications


Accuracy = ,
Total number of test cases

94
Metrics

Accuracy
F1
Precision
Recall
AUC (Area Under the Curve)
ROC (Receiver Operating Characteristic)
Efficiency (time, memory)

95
YES (Actual) No (Actual)
YES (Predicted) a b
NO (Predicted) c d

Measurements
Precision p = a / (a+b)
Recall r = a / (a+c)
F1 value F1 = 2rp / (r+p)
Tradeoff between Precision and Recall
kNN tends to have higher precision than recall,
especially when k becomes larger. 96
AUC

97
Overfitting?

98
Overfitting?

Training Set
AUC (how good)

Testing Set

Model Complexity
99
When your model has too many parameters relative
to the number of data points, you're prone to
overestimate the utility of your model.
Over fitting means that you are fitting your model to
the noise instead of the underlying signal.
An over-fit model is it is a model that is overly bound
to the training data.
This means that it does an excellent job of 'predicting' the training data and a
very poor job of predicting any other data (test data).

100
Which one is over fitted?

101
Curse of Dimensionality

102
Increase in number of dimensions leads to rapid increase in volume.
This means, as the dimensions increase we need to collect exponentially
larger quantities of data (to be statistical significant). This exponential
increase of data is the curse. It limits our ability to store, compute and
make decisions quickly
The classic inverse problem is just a linear equation Ax =b We seek
solutions like x = inverse(A)*b
The curse of dimensionality simply means that you have
way more features, or dimensions, than you have data
points, and consequently, you can not actually invert A (it is
singular) to obtain a unique solution. A standard solution is
to add some additional information (i.e. Regularization,
Bayesian Prior)

103
Features

104
Features

Features can be good/bad


Use training set to find
Features are domain dependent
Feature Selection algorithm are used to find
good features (when you have many more
than expected)
E.g. Principle Component analysis
(PCA)

105
Good Features

106
Good Features

107
Good Features

108
Detect Multiple Faces?

109
Sliding Window

110
Sliding Window

111
No Face

112
No Face

113
No Face

114
Maybe Face

115
Face!

116
The End?

117