Professional Documents
Culture Documents
1
Lecture Overview
• What is Data Mining?
Definition, origin, purpose, DD, applications, data to be mined
• Data preparation
• Classification of DM Tasks
• Classification
Decision tree, rule based classification)
• Clustering
K-means, k-medoids, hierarchical methods
• Association analysis
• Outlier analysis
• Reading list & references
What is Data Mining?
• Process of extracting hidden and useful knowledge or
patterns in large databases
• Automates the information discovery process
_Origins of DM_
DM and Statistics
➢ Statistical models as outcome of DM
Example: In DM tasks such as data characterization and
classification, statistical models of target classes can be built.
➢ DM built upon statistical models
Example: Statistics may be used to model noise and missing
values in data.
➢ Statistics for mining patterns in the data set and
understand the underlying mechanism that generates
or affectes these patterns
➢ Statistical methods to verify data mining results
Example: The results of a classification or prediction model should
be verified by statistical hypothesis testing
Why Data Mining?
• Statistical analysis deals with structured data to solve structured problems,
but DM is used to solve unstructured business problems
• Automatic analysis of massive data
• DM covers the entire process of data analysis, including data cleaning and
preparation and visualization of the results, and how to produce
predictions in real-time, etc.
• DM can handle with the following situations:
– Sample size is relatively large
– Mixture of different type of variables
– Existence of outliers and missing values
– Irrelevant and redundant attributes
Knowledge discovery(mining) in databases
(KDD)
• Data warehouses
• Relational databases
• Transactional databases
• Time series data and temporal data
• Spatial data
• Financial transactions
• Credit card transactions
• Customer complaint calls
• Text data
• Multi-media databases
• Web data
A Multi-Dimensional View of
DM Classification
Example DM Applications
Business:
• Sales forecasting
• Investment analysis
• Customer Relationship Management (CRM)
– Customer profiling, customer purchasing trends, customer churn
analysis, customer segmentation for target marketing
• Market basket analysis
• Market segmentation
• Quality control
• Fraud detection (identify credit applicants with high risk, intrusion
detection, etc.)
• Risk analysis (e.g. Customer retention, quality control)
• Product benchmarking
• Energy usage analysis
Example DM Applications
Web: Web mining
• Search engines, rank web pages, mining purchasing data for
recommendations
• Social network analysis
• E-commerce
• Data cleaning
• Data integration
• Data reduction
• Data transformation
Data in Real World:
• Lots of potentially incorrect and inconsistent data, e.g., instrument faulty, human or
computer error, transmission error
– incomplete: Lacking certain attributes of interest, or containing only aggregate data
• e.g., Education=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Job=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“52”, Birthday=“03/07/1986”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
Possible reasons:
• Jan. 1 as everyone’s birthday? • Faulty data collection instrument
• Data entry problems
• Data transmission problems
• Missing data
• Technology limitation, etc.
Reference: (Han et al., 2012)
DATA PREPROCESSING
Data cleaning: Remove noisy data, remove outliers, fill in
missing values, correct inconsistencies in data
1) Ignore the tuple
2) Fill in the missing value manually
3) Use a global constant to fill in the missing value
4) Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
5) Use the attribute mean or median for all samples belonging to the
same class as the given tuple
6) Use the most probable value to fill in the missing value
DATA PREPROCESSING
Data integration: Merging of data from multiple sources into a
coherent data store such as a data warehouse.
Data reduction:
– Reduce data size by aggregating
– Dimension reduction/ eliminating unimportant attributes (e.g.
Principle Component Analysis, wavelet transf.)
– Numerosity reduction (e.g. regression. Histogram, clustering,
sampling, data cube aggregation)
– Data compression
– Discretization and concept hierarchy generation
– Clustering, etc.
DATA PREPROCESSING
Example:
– Histogram for data reduction
✓ Min-max normalization
Data Transformation
✓ z-score normalization
Example:
CLASSIFICATION OF DM
TASKS
Characterization and Discrimination
Describe classes or concepts in summarised, concise and
precise terms
Data characterization: summarization of the general characteristics or
features of objects in a target class of data
• Resulting descriptions can be represented by “Generalized relations of
characteristic rules”
• Outputs of characterisation can be represented in:
– Pie charts
– Bar charts
– Curves
– Multidimensional data cubes
– Multidimensional tables (e.g. Crosstabs).
Example: Summarise the characteristics of customers who spend more
than £1000 in a year in Sainsbury’s (married, 40-50 years old, employed,
occupation?)
Characterization and Discrimination
Data discrimination: comparison of the general features of the target
class data objects against the general features of objects from one or
multiple contrasting classes
Techniques used for data discrimination are similar to those used for data
characterization with the exception that data discrimination results include
comparative measures.
Reference: https://gerardnico.com/data_mining/simple_regression
CLASSIFICATION
• Extracting models (functions) that map items into predefined classes
• The model are derived based on the analysis of a set of training data
(i.e., data objects for which the class labels are known)
• Examples: detecting spam email messages based on title and
content, categorising cancer cells based on MRI scans, classfying
galaxies based on their shapes (NASA)
• Difference between prediction and classification?
1) Learning/training step:
• Classification algorithm builds the classifier by analyzing a
predetermined set of data classes or concepts
2) Clasification step
CLASSIFICATION
CLASSIFICATION
• Tries to partition n
objects into k clusters in
which each object
belongs to the cluster
with the nearest mean.
Partitioning Methods
K-medoids method:
• Each cluster is represented by one of the objects in the
cluster
Hierarchical Methods
• In some situations we may want to partition our data into groups at
different levels such as in a hierarchy.
• A hierarchical clustering method works by grouping data objects
into a hierarchy or “tree” of clusters.
• Useful for data summarization and visualization
• Example:
– Handwritten character recognition: A set of handwriting samples may
be first partitioned into general groups where each group corresponds
to a unique character. Some groups can be further partitioned into
subgroups since a character may be written in multiple substantially
different ways.
– Study of evolution: Hierarchical clustering may group animals
according to their biological features to uncover evolutionary paths,
which are a hierarchy of species.
Hierarchical Methods
Decomposition is formed in a bottom-up (merging) or
topdown (splitting) approach?
• For the merging step, it finds the two clusters that are closest to
each other (according to some similarity measure), and combines
the two to form one cluster.
• Two clusters are merged per iteration, where each cluster contains
at least one object, therefore an agglomerative method requires at
most n iterations.
Hierarchical Methods
Decomposition is formed in a bottom-up (merging) or
topdown (splitting) approach?
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Outlier Analysis
Example:
Uncover fraudulent usage of credit cards; purchases of
unusually large amounts/location/types of purchase/purchase
frequency for a given account number in comparison to regular
charges
Data Mining Software
Free software
• R
• Weka
Commercial Software
• SAS Enterprise Miner
• SPSS Clementine
• IBM Intelligent Miner
What you should have learned…
References
• Han, J., Kamber, M., Pei, J. (2012). Data Mining: Concepts and
Techniques. 3rd Edition, Morgan Kaufmann.
• Tan, P.N., Steinbach, M., Kumar, V. (2013). Introduction to Data Mining.
Pearson
• Son, N.H. Introduction to KDD and data mining. Lecture Notes.
• NASA (2018) https://blogs.nasa.gov/disaster-response/ (accessed 8 Jan
2019)
• Reil, T. (2019). Artificial Neural Networks.
https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwiDp-jy0t7fAhWBAWMBHR-
FDY8QFjAAegQIBRAC&url=http%3A%2F%2Fweb.cecs.pdx.edu%2F~mperkows%2FCAPSTONES%2F2005%2FL005.Neural_Networ
ks.ppt&usg=AOvVaw0l7xjtC5ZxBdincADnrFAr (accessed 8 Jan 2019)
• https://www.kdnuggets.com/
Reading List
• Er Kara, M., Oktay Fırat, S.Ü., Ghadge, A. (In Press). A data mining-
based framework for supply chain risk management. Computers &
Industrial Engineering. https://doi.org/10.1016/j.cie.2018.12.017