You are on page 1of 34

Data Mining

and OLAP
Stages of Data Mining and OLAP
(with thanks to Janet Francis)

CREATE THE DIFFERENCE


Aims
• This lecture aims to cover
– The nature of data mining
– Stages of Data Mining
– OLAP

CREATE THE DIFFERENCE


What is Data Mining
 The term Data Mining is used because mining
for valuable data in a large database is similar
to mining for a valuable ore in a huge
mountain.
◦ In a mining operation large amounts of low grade
materials are sifted through in order to find
something of value.
◦ In its computing counterpart large volumes of data
are searched in an attempt to find something
worthwhile.

CREATE THE DIFFERENCE


A useful Scenario
 The following scenario will be used in this
lecture in order to make the processes
seem more relevant.
◦ Beaconside JAMS PLC (BJP) supplies
jams/cake fillings/sauces to confectioners and
bakers. Customers include large multi
nationals and small specialist outlets.
◦ BJ has centres in England, Scotland and Spain
◦ BJ is not a manufacturing organisation – it is a
retailer which means that it buys from a
distributor and sells on to customers.

CREATE THE DIFFERENCE


Human vs Data Mining
• Human
– Usually takes the form of hypothesis verification
• The analyst has a theory – we sell more high margin goods pro rata
to specialist outlets than to multinationals and specialist outlets are
more profitable
• The analyst gathers the necessary data and proves or disproves or
amends and re tests the hypothesis
• Data Mining
– can perform hypothesis verification – on vast quantities of data
• Data Mining allows the user to discover patterns that the user did not
know existed!

CREATE THE DIFFERENCE


Types of Data Mining
 Directed
 Undirected

CREATE THE DIFFERENCE


Directed Data Mining
 A top down approach – used when there is
some idea of what is being looked for (some
direction for the search) or and idea of what
might be predicted
 The goal is to create a predictive model or
set of models from the existing data which
can then be used to predict future trends.
 For example - which customers are most
likely to be interested in a new type of cake
filling

CREATE THE DIFFERENCE


Undirected Data Mining
 A bottom-up approach - the data itself
determines the relationships – for example using
clustering– If patterns are found it is for the user
to determine whether the patterns are useful or
not.
 The goal is to find patterns in the existing data.
◦ Human interaction is necessary because only people
can determine what significance, if any, the patterns
have
• This type of data mining is one of the key steps in
Knowledge Discovery in Databases (KDD).
• Necessary to know how the model works and how it
comes up with the answer in order to decide if
patterns are valid
 Example: People who are over 5ft tall with brown
hair like Blackcurrent Jam

CREATE THE DIFFERENCE


Approaches to Data Mining
 Descriptive
◦ Describes the current data in terms of rules or
patterns

 Predictive
◦ Identify a set of rules/model which can be used
to predict currently unknown values

CREATE THE DIFFERENCE


Descriptive Data Mining uses
 Market Basket Analysis
 Clustering
 Classification

CREATE THE DIFFERENCE


Descriptive Data Mining uses: Market
Basket Analysis
 Identifies relationships between data – for
example, patterns in transaction purchases
 A rule(s) can be developed. The rule is
supported depending on the frequency of the
occurrence and a confidence interval can be
calculated and expressed as a ratio
 This is also known as market basket analysis
 For example: People who buy Blackcurrent
Jam also buy Redcurrent Jelly
◦ Beer and Nappies?

CREATE THE DIFFERENCE


Example
• BJP analysts discovered that sales of Strawberry Jam
increased:
– When the customer was offered a small pot of Blackcurrent Jam
free with the purchase
– With the height of the person buying the product

• How commercially useful is this information?


• Just because there is a correlation, does not mean
it is useful

CREATE THE DIFFERENCE


Descriptive Data Mining uses:
Clustering
• Identifies the natural groupings within data – e.g. customers
may be classified into groups – known as customer
segmentation this is useful in Customer Relationship
Management (CRM)
• Data items within groups should be as similar as possible to
each other and as different as possible to other groups
• Need to determine parameters which will result in realistic
clusters

CREATE THE DIFFERENCE


Example
• BJP has identified clusters of customers who buy
only jam, customers who buy only cake fillings,
customers who buy both

• How would this be commercially useful?

CREATE THE DIFFERENCE


Descriptive Data Mining uses:
Classification
• Data of interest is sorted into predefined classes
• BJP classifies customers as
– Multinational;
– UK based;
– independent chain;
– single outlet

CREATE THE DIFFERENCE


Predictive Data Mining Use
 Customers in the single outlet category
typically order jams and sauces but not
cake fillings
 A new client is placed in the single outlet
category – it is possible to predict likely
ordering patterns

CREATE THE DIFFERENCE


Stages in Data-Mining
1. Preparation of data
◦ This stage involves selection and preparation of input data from a
variety of sources
 Data integration
 Data cleansing
 Data warehousing (this usually includes the above)

2. Mining stage
◦ This stage involves producing useful predictive models (OLAP)

3. Interpretation and Evaluation –


Knowledge Discovery
◦ The final stage involves deploying the models and applying them to
new data in order to generate predictions or new knowledge.

CREATE THE DIFFERENCE


1. Preparation of Data
 Input data must be in or converted to
electronic form. It could come from a
variety of different sources such as:
◦ Operational Databases (sales, finance etc.)
◦ Commercial Databases (demographics)
◦ Internet documents
◦ Spreadsheets or other “office” documents
 The input data must be integrated and
cleansed.
 Note – much of the preparation is
CREATE THE DIFFERENCE
Data Integration
 Data from different sources must be
integrated to provide heterogeneity
◦ Involves de-normalisation of databases
◦ Dates and times must be of the same format.
◦ Records must be in the same type

CREATE THE DIFFERENCE


Data Cleansing
 Once integrated, the data must be cleansed to resolve the following
issues
◦ Duplicate data
 Need to delete
◦ Missing values (unrecorded or really missing?)
 Unrecorded - might not have been required in one or more of the
contributing data sets. Could be added if based on other values eg. Post code.
 Really missing- could actually denotes a missing value eg. An unpaid bill.
 Need to decide how missing values will be represented.
◦ Irrelevant values
 Need expert to identify sets and delete
◦ Inaccurate data
 can identify anomalies by using graphs and clusters. Values outside the normal
expected range can be investigated.
◦ Old data
 Need to delete

CREATE THE DIFFERENCE


What are demographic overlays?
 Most customer databases include post codes.
 Various data is collected via census and based on post codes eg.
◦ Gender Distribution
◦ Age distribution
 Other data is known about areas eg
◦ Proximity to the coast
◦ Major employers
◦ Proximity to National parks
 This data could be used in conjunction with customer data to
predict trends. Eg
◦ If a product sells well in one area close to the coast with a higher than
average percentage of old ladies, then it might be worth marketing that
product in other such areas.

CREATE THE DIFFERENCE


2. Mining stage
A Typical Data Set

Customer names in a
certain post code area

It is known that in this area


75% of the population is considered
Rich and 75% is male

CREATE THE DIFFERENCE


Histograms

1 dimensional 2 dimensional
Number
61-70

51-60

Poor
41-50
Rich
Male
31-40

21-30

0 1 2 3 4 5 6 7 8

Female

61-70
0 2 4 6 8 10 12 14
51-60

Poor
41-50
Rich

31-40

21-30

0% 20% 40% 60% 80% 100%

CREATE THE DIFFERENCE


Into the 3rd Dimension
F
Female
M
Male
Poor
Rich

•Even with just two attributes each with two values the table is
more difficult to understand.
•What if there were 16 attributes each with multiple values?
•The number of 2d histograms which could be potentially
useful would be over 100.
•This structure is known as an OLAP cube.

CREATE THE DIFFERENCE


On-line Analytical Processing OLAP
 OLAP functionality is characterised by dynamic multi-dimensional
analysis of consolidated enterprise data:
◦ Slice: A slice is a subset of a multi-dimensional array corresponding to a
single value for one or more members of the dimensions not in the
subset.
◦ Dice: The dice operation is a slice on more than two dimensions of a
data cube (or more than two consecutive slices).
◦ Drill Down/Up: Drilling down or up is a specific analytical technique
whereby the user navigates among levels of data ranging from the most
summarized (up) to the most detailed (down).
◦ Roll-up: A roll-up involves computing all of the data relationships for one
or more dimensions. To do this, a computational relationship or formula
might be defined.
◦ Pivot: To change the dimensional orientation of a report or page display

CREATE THE DIFFERENCE


OLAP

 Uses various algorithms – examples are:


a.Decision trees
b. K-Nearest neighbour

CREATE THE DIFFERENCE


Decision trees

 The Decision Tree is one of the most popular


classification algorithms in current use in Data
Mining
◦ A decision tree takes as input an object or
situation described by a set of properties, and
outputs a yes/no decision.
◦ Algorithm is recursive partitioning – divide and
conquer.
◦ Internal nodes denote a test.
◦ A branch represents the outcome CREATE THE DIFFERENCE
Decision tree example
Candidate for class
label Tests Rules
Rabbit
Rabbit does not
Wings? Y N have Wings

Rabbit does
Not Swims? Y N
not swim
in class! Rabbit
legs? has legs
Y N
Rabbit has
Internal node whiskers
Whiskers? Y N

Rabbit does
not eat meat
Eats Meat?
Leaf node Y N

CREATE THE DIFFERENCE


Need to decide
 Which attributes to select in order to
identify the class of the sample as quickly
as possible?
 When to stop?
◦ No remaining attributes to test or when the
class is determined

CREATE THE DIFFERENCE


K-Nearest neighbour
 k-nearest neighbour algorithm (k-NN) is a popular method for classification
◦ feature space is a multidimensional space where each pattern sample is
represented as a point whose dimension is determined by the number of features
used to describe the patterns.
◦ Firstly the training samples and their class labels are plotted in the
multidimensional feature space. The space is then partitioned into regions by class
labels of the training samples. The training phase of the algorithm consists simply of
plotting the points in the feature space.
◦ In the actual classification phase, the same features as before are computed for a
test sample. Distances from the new point to all stored points are computed and k
closest samples are selected. The test sample is assigned to the class whose label is
the most frequent among the k nearest training samples.
 The algorithm is easy to implement, but it is computationally intensive,
especially when the size of the training set grows.

CREATE THE DIFFERENCE


K-Nearest neighbour Example
This is simplistic – usually 16 or more attributes are used.
The small coloured dots are the training samples
Each colour represents a different class label
The large black dots are test samples

When K> 5 boundaries are less distinct in most cases

CREATE THE DIFFERENCE


Need to decide
 A value for k

 The best choice of k depends upon the data


◦ generally, larger values of k reduce the effect of noise on the
classification, but make boundaries between classes less distinct

CREATE THE DIFFERENCE


3. Interpretation and Evaluation
 Uses of Data Mining in business
◦ Market segmentation
 Identify the common characteristics of customers who buy certain products from a
company .
◦ Customer churn
 Predict which customers are likely to leave your company and go to a competitor.
◦ Fraud detection
 Identify which transactions are most likely to be fraudulent .
◦ Direct marketing
 Identify which prospects should be included in a mailing list to obtain the highest
response rate.

◦ Supermarket basket analysis


 Understand what products or services are commonly purchased together.
◦ Trend analysis
 Reveal the difference between a typical customer this month and last. Allows
organisations to map trends and

CREATE THE DIFFERENCE


Further Reading
 The OLAP report
 A view from QUB
 Date chapter 22

CREATE THE DIFFERENCE

You might also like