You are on page 1of 66

Big Data Analytics

Lecture 6 Week 7 – Data Mining I

Dr. Dhanan Utomo


d.utomo@hw.ac.uk

1
Lecture Overview
• What is Data Mining?
Definition, origin, purpose, DD, applications, data to be mined
• Data preparation
• Classification of DM Tasks
• Classification
Decision tree, rule based classification)
• Clustering
K-means, k-medoids, hierarchical methods
• Association analysis
• Outlier analysis
• Reading list & references
What is Data Mining?
• Process of extracting hidden and useful knowledge or
patterns in large databases
• Automates the information discovery process

Interesting knowledge (novel, previously unknown, implicit,


valid, useful, understandable, validates a hypothesis)
• Rules (e.g. Association and dependency rules)
• Patterns (e.g. Sequential patterns, temporal sequences)
• Models
• Characterisation
• Deviation, etc.
Evolution of Database
Technology and DM

• 1960s and earlier: Data


collection, database creation
Statistics AI / Machine
Learning
• 1970s to early 1980s:
Database Management
Systems Data Mining

• 1990s – 2000s: DM and data


warehousing
Database
systems

_Origins of DM_
DM and Statistics
➢ Statistical models as outcome of DM
Example: In DM tasks such as data characterization and
classification, statistical models of target classes can be built.
➢ DM built upon statistical models
Example: Statistics may be used to model noise and missing
values in data.
➢ Statistics for mining patterns in the data set and
understand the underlying mechanism that generates
or affectes these patterns
➢ Statistical methods to verify data mining results
Example: The results of a classification or prediction model should
be verified by statistical hypothesis testing
Why Data Mining?
• Statistical analysis deals with structured data to solve structured problems,
but DM is used to solve unstructured business problems
• Automatic analysis of massive data
• DM covers the entire process of data analysis, including data cleaning and
preparation and visualization of the results, and how to produce
predictions in real-time, etc.
• DM can handle with the following situations:
– Sample size is relatively large
– Mixture of different type of variables
– Existence of outliers and missing values
– Irrelevant and redundant attributes
Knowledge discovery(mining) in databases
(KDD)

• Process of extracting hidden and useful knowledge or


patterns in large databases
• Automates the information discovery process
Goals of DM
• Prediction: To foresee the possible future situation on the basis of
previous events.
Given sales recordings from previous years can we predict what amount
of goods we need to have in stock for the forthcoming season?
• Description: What is the reason that some events occur?
What are the reasons for the cars of one producer to sell betterthat equal
products of other producers?
• Verification: We think that some relationship between entities occur.
Can we check if (and how) the thread of cancer is related to environmental
conditions?
• Exception detection: There may be situations (records) in our databases
that correspond to something unusual.
Is it possible to identify credit card transactions that are infact frauds?
Data to be mined
As a general technology, DM can be applied to any kind of data as long
as it is meaningful for the target application

• Data warehouses
• Relational databases
• Transactional databases
• Time series data and temporal data
• Spatial data
• Financial transactions
• Credit card transactions
• Customer complaint calls
• Text data
• Multi-media databases
• Web data
A Multi-Dimensional View of
DM Classification
Example DM Applications
Business:
• Sales forecasting
• Investment analysis
• Customer Relationship Management (CRM)
– Customer profiling, customer purchasing trends, customer churn
analysis, customer segmentation for target marketing
• Market basket analysis
• Market segmentation
• Quality control
• Fraud detection (identify credit applicants with high risk, intrusion
detection, etc.)
• Risk analysis (e.g. Customer retention, quality control)
• Product benchmarking
• Energy usage analysis
Example DM Applications
Web: Web mining
• Search engines, rank web pages, mining purchasing data for
recommendations
• Social network analysis
• E-commerce

Pharmaceutical companies, Insurance and Health care, Medicine


• Drug development
• Identify successful medical therapies
• Claims analysis, fraudulent behavior
• Medical diagnostic tools
• Predict office visits
• Medical data analysis
• Biological data analysis
Basic DM Tasks

Predictive Tasks Descriptive Tasks


• Classification • Clustering
• Regression • Summarisation
• Time Series Analysis • Characterisation
• Prediction • Discrimination
• Association Rules
• Sequential Pattern
Discovery
Basic DM Tasks
Example
Basic DM Tasks
Example
Basic DM Tasks
Example
Basic DM Tasks
Example
Basic DM Tasks
Example
Basic DM Tasks
Example
Basic DM Tasks
Example
DATA PREPARATION
DATA PREPROCESSING
• Real-world data is dirty (missing, incomplete, noisy,
inconsistent)
• Low-quality data → low-quality DM results
• Data quality factors: accuracy, completeness, consistency,
timeliness, believability, interpretability

• Data cleaning
• Data integration
• Data reduction
• Data transformation
Data in Real World:
• Lots of potentially incorrect and inconsistent data, e.g., instrument faulty, human or
computer error, transmission error
– incomplete: Lacking certain attributes of interest, or containing only aggregate data
• e.g., Education=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Job=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“52”, Birthday=“03/07/1986”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
Possible reasons:
• Jan. 1 as everyone’s birthday? • Faulty data collection instrument
• Data entry problems
• Data transmission problems
• Missing data
• Technology limitation, etc.
Reference: (Han et al., 2012)
DATA PREPROCESSING
Data cleaning: Remove noisy data, remove outliers, fill in
missing values, correct inconsistencies in data
1) Ignore the tuple
2) Fill in the missing value manually
3) Use a global constant to fill in the missing value
4) Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
5) Use the attribute mean or median for all samples belonging to the
same class as the given tuple
6) Use the most probable value to fill in the missing value
DATA PREPROCESSING
Data integration: Merging of data from multiple sources into a
coherent data store such as a data warehouse.
Data reduction:
– Reduce data size by aggregating
– Dimension reduction/ eliminating unimportant attributes (e.g.
Principle Component Analysis, wavelet transf.)
– Numerosity reduction (e.g. regression. Histogram, clustering,
sampling, data cube aggregation)
– Data compression
– Discretization and concept hierarchy generation
– Clustering, etc.
DATA PREPROCESSING
Example:
– Histogram for data reduction

Individual values are


combined into several classes
DATA PREPROCESSING
Example: Discretization Without Using Class Labels
(Binning vs. Clustering)

Equal interval width (binning)


Data

Equal frequency (binning) K-means clustering


we divide a dataset into k We group the data based on the
bins that all have an equal closeness to the group mean
number of frequencies

Reference: Han et al., 2011


DATA TRANSFORMATION
Data Transformation
Data transformation:
Smoothing: remove noise from the data (binning, regression,
clustering, etc.)
Attribute/feature construction: New attributes are constructed and
added from the given set of attributes
Aggregation: Exp; aggregate daily sales data to compute monthly or
annual amount
Discretization: Raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual
labels (e.g., youth, adult, senior).
Concept hierarchy generation for nominal data: Attributes such as
street can be generalized to higher-level concepts, like city or country.
Data Transformation
Data transformation:
Normalization (!): Attribute data are scaled to fall within a smaller
range to avoid dependence on the choice of measurement units

✓ Min-max normalization
Data Transformation
✓ z-score normalization

where 𝐴ҧ and 𝜎𝐴 are the mean and


standard deviation of an attribute A,
respectively.

Example:
CLASSIFICATION OF DM
TASKS
Characterization and Discrimination
Describe classes or concepts in summarised, concise and
precise terms
Data characterization: summarization of the general characteristics or
features of objects in a target class of data
• Resulting descriptions can be represented by “Generalized relations of
characteristic rules”
• Outputs of characterisation can be represented in:
– Pie charts
– Bar charts
– Curves
– Multidimensional data cubes
– Multidimensional tables (e.g. Crosstabs).
Example: Summarise the characteristics of customers who spend more
than £1000 in a year in Sainsbury’s (married, 40-50 years old, employed,
occupation?)
Characterization and Discrimination
Data discrimination: comparison of the general features of the target
class data objects against the general features of objects from one or
multiple contrasting classes

Techniques used for data discrimination are similar to those used for data
characterization with the exception that data discrimination results include
comparative measures.

Example: Compare the general characteristics of customers who spent


more than £1000 in a year in Sainsbury’s with those who spent less
PREDICTION
• Based on fitting a curve through the data to find a relationship from
the predictor to the predicted
• Most prediction techniques are based on mathematical models:
– Simple statistical models such as regression
– Non-linear statistics such as power series
– Neural Networks
Regression: Predict values of a dependent variable, Y, based on its
relationship with values of at least one independent variable, X.

Reference: https://gerardnico.com/data_mining/simple_regression
CLASSIFICATION
• Extracting models (functions) that map items into predefined classes
• The model are derived based on the analysis of a set of training data
(i.e., data objects for which the class labels are known)
• Examples: detecting spam email messages based on title and
content, categorising cancer cells based on MRI scans, classfying
galaxies based on their shapes (NASA)
• Difference between prediction and classification?

1) Learning/training step:
• Classification algorithm builds the classifier by analyzing a
predetermined set of data classes or concepts
2) Clasification step
CLASSIFICATION
CLASSIFICATION

The derived model may be represented in various form such as;


• Rule based classification (i.e., IF-THEN rules)
• Decision trees
• Mathematical formulations
• Naive bayesian classification
• Neural networks
• Support vector machines
• K-nearest-neighbor classification
Example

Reference: (Han et al., 2012)


Decision Tree
• Flowchart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and
tree leaves represent class labels
Rule-Based Classification
• Learned model is represented as a set of IF-THEN rules.

• Coverage and accuracy of the rule R


D: Class label data set
|D|: Number of tuples in D
𝑛covers : number of tuples coverred by R
𝑛correct : number of tuples correctly classified by R
CLUSTERING
• Partitioning a set of data objects (or observations) into
subsets
• Each subset is called cluster
– Maximize intra-class similarity principle
– Minimize inter-class similarity
• Example applications:
– Cluster supplier into groups thah have similar characteristics
– Discover clusters/Sub-clusters in handwritten character
recognition system
CLUSTERING

Reference: (Han et al., 2012)


Partitioning Methods
• The simplest and most fundamental version of cluster analysis
• Number of clusters is given as background knowledge =
starting point for partitioning methods
• Given k, find a partition of k clusters that optimizes a chosen
criteria; e.g. Dissimilarity function based on distance, Minimizing
the sum of square errors
• Each cluster has at least one object, each object belongs to one
cluster
• Most famous methods: k-means and k-medoids

Reference: (Han et al., 2012)


Partitioning Methods
K-means method:
• Each cluster’s center is
represented by the
mean value of the
objects in the cluster
(center of the cluster)

• Tries to partition n
objects into k clusters in
which each object
belongs to the cluster
with the nearest mean.
Partitioning Methods

K-medoids method:
• Each cluster is represented by one of the objects in the
cluster
Hierarchical Methods
• In some situations we may want to partition our data into groups at
different levels such as in a hierarchy.
• A hierarchical clustering method works by grouping data objects
into a hierarchy or “tree” of clusters.
• Useful for data summarization and visualization
• Example:
– Handwritten character recognition: A set of handwriting samples may
be first partitioned into general groups where each group corresponds
to a unique character. Some groups can be further partitioned into
subgroups since a character may be written in multiple substantially
different ways.
– Study of evolution: Hierarchical clustering may group animals
according to their biological features to uncover evolutionary paths,
which are a hierarchy of species.
Hierarchical Methods
Decomposition is formed in a bottom-up (merging) or
topdown (splitting) approach?

• Agglomerative methods start with individual objects as clusters,


which are iteratively merged to form larger clusters.

• For the merging step, it finds the two clusters that are closest to
each other (according to some similarity measure), and combines
the two to form one cluster.

• Two clusters are merged per iteration, where each cluster contains
at least one object, therefore an agglomerative method requires at
most n iterations.
Hierarchical Methods
Decomposition is formed in a bottom-up (merging) or
topdown (splitting) approach?

• Divisive methods starts by placing all objects in one cluster and


then divides the root cluster into several smaller sub-clusters, and
recursively partitions those clusters into smaller ones.

• The partitioning process continues until each cluster at the lowest


level is coherent enough—either containing only one object, or the
objects within a cluster are sufficiently similar to each other.

• In either agglomerative or divisive hierarchical clustering, a user


can specify the desired number of clusters as a termination
condition.
Hierarchical Methods

Reference: (Han et al., 2012)


Hierarchical clustering with R
ASSOCIATION ANALYSIS
• A frequent itemset: set of items that often appear together in a
transactional data set
Example; milk and bread, which are frequently bought together in
grocery stores by many customers

• Sequential pattern: a frequently occurring subsequence


Example; pattern that customers tend to purchase first a laptop,
followed by a digital camera, and then a memory card

Discovery of associations and correlations among items in


large transactional or relational data sets
Association Rule Mining

Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


SEQUENTIAL PATTERN
MINING
Sequence database: A sequence database consists of ordered
elements or events

Example areas: Customer shopping habits, natural disasters,


medical treatments, phone call patterns, DNA sequences
OUTLIER ANALYSIS
Outlier: a data object that does not comply with the general
behavior or model of the data
1) Many DM methods discars outliers as noise or exceptions
2) Rare events may be more interesting than regular objects in
some applications (e.g. fraud detection, risk analysis) (!)

Outlier Analysis

Example:
Uncover fraudulent usage of credit cards; purchases of
unusually large amounts/location/types of purchase/purchase
frequency for a given account number in comparison to regular
charges
Data Mining Software
Free software
• R
• Weka
Commercial Software
• SAS Enterprise Miner
• SPSS Clementine
• IBM Intelligent Miner
What you should have learned…
References
• Han, J., Kamber, M., Pei, J. (2012). Data Mining: Concepts and
Techniques. 3rd Edition, Morgan Kaufmann.
• Tan, P.N., Steinbach, M., Kumar, V. (2013). Introduction to Data Mining.
Pearson
• Son, N.H. Introduction to KDD and data mining. Lecture Notes.
• NASA (2018) https://blogs.nasa.gov/disaster-response/ (accessed 8 Jan
2019)
• Reil, T. (2019). Artificial Neural Networks.
https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwiDp-jy0t7fAhWBAWMBHR-
FDY8QFjAAegQIBRAC&url=http%3A%2F%2Fweb.cecs.pdx.edu%2F~mperkows%2FCAPSTONES%2F2005%2FL005.Neural_Networ
ks.ppt&usg=AOvVaw0l7xjtC5ZxBdincADnrFAr (accessed 8 Jan 2019)

• https://www.kdnuggets.com/
Reading List
• Er Kara, M., Oktay Fırat, S.Ü., Ghadge, A. (In Press). A data mining-
based framework for supply chain risk management. Computers &
Industrial Engineering. https://doi.org/10.1016/j.cie.2018.12.017

• Zhao (2013) ‘R and Data Mining’. Available as pdf via HW library.


https://www.sciencedirect.com/book/9780123969637/r-and-data-
mining
Web Sources for DM
• https://www.kdnuggets.com/
• https://newonlinecourses.science.psu.edu/stat5
05/node/138/
• http://www.rdatamining.com
The next lecture will cover…
• Examples of different algorithms for
association, clustering and classification tasks
What questions do you have?
You can:
• Ask now.
• Check the reading list
• Email me: d.utomo@hw.ac.uk
DATA PREPROCESSING
Noise: a random error or variance in a measured variable

• Basic statistical description techniques (e.g., boxplots and scatter


plots), data visualization methods can be used to identify outliers
• Smoothing:
– Binning methods smooth a sorted data value by consulting its
“neighborhood,” Sorted values are distributed into a number of
“buckets,” or bins.
• Smoothing by bin means
• Smoothing by bin medians
• Smoothing by bin boundaries
• Regression: Linear regression involves finding the “best” line to fit
two attributes so that one attribute can be used to predict the other.
• Outlier analysis: e.g. clustering
Rule-Based Classification
Conflict resolution strategies for rules
Size ordering strategy:
• Assigns highest priority to the triggering rule that has the highest
antecedent size
• Triggering rule with the most attribute tests is fired.
Rule ordering
• Class-based ordering: the classes are sorted in order of decreasing
“importance” such as by decreasing order of prevalence. That is, all
the rules for the most prevalent (or most frequent) class come first,
the rules for the next prevalent class come next, and so on.
• Rule-based ordering: the rules are organized into one long priority
list, according to some measure of rule quality, such as accuracy,
coverage, or size (number of attribute tests in the rule antecedent),
or based on advice from domain experts.

You might also like