You are on page 1of 29

Informa)cs

Lecture 6 Processing Informa4on


Introduc)on
We have no shortage of data about almost
anything of interest
A well designed database can make that data
easy to access
The use of SQL can do simple interroga)ons of
the data
A huge amount of useful informa4on lies
hidden however the need for data mining

Introduc)on
So in this lecture we will look at the elements
of data mining
We will begin however by looking at simple
ways in which our original data may be
processed so that the more complex stages
later on are not compromised

Processing data
Regardless of the source of the data we can
encounter a number of issues:
Errors some data is wrong due to a fault or a
simple transcrip)on error.
Outliers some data is very dierent to the
rest can be signicant if true
Calibra)on the data may need to be
converted to a physical quan)ty to check

Processing data
Test ar)fact it is some)mes possible to
include an object in the data collec)on whose
proper)es are well known we can then
check what has been recorded

Processing data
With data that begins as analogue, especially
audio and video, there are a number of
processing methods that can be used to prepare
the data for later stages:
Stretch if the data can range from 0-100 but
we only record 0-20 we can stretch the data
to use the whole range
Equalise we can modify a range of 20-60 to
use 0-100

Processing data
Filtering
Lo pass lter hiss and noise
Hi pass lter rumble and hum
Band pass selec)ve ltering

Averaging to smooth noisy data and prevent


data spikes
Enhancements a huge range in images for
deblur, distor)on and feature extrac)on

Examples

What is data mining?


The non-trivial extrac)on of implicit,
previously unknown and poten)ally useful
knowledge from data
KDD a process of Knowledge Discovery in
Databases
Associated areas are Sta)s)cs, SQL, Machine
Learning, AI and Expert Systems

Knowledge is power
Remember the hierarchy that we aspire to work
through:

Data facts and gures accuracy important
Informa)on organised data for analysis
Knowledge interpreta)on to inform ac)on

Applica)on areas

Insurance claim analysis and risk


Medical diagnosis and preventa)ve medicine
Banking iden)fying fraud
Marke)ng new customers and sales
Science human genome project
Security iden)fy behaviours
Business intelligence trends and threats

Scope of data mining


Data mining can try to use data in a variety of
ways using sophis)cated mathema)cal
techniques:
Classica)on
Es)ma)on
Clustering
Associa)on

Classica)on
Use data to predict the category of an object
e.g. someone to lend money to or perhaps
arrest or perhaps someone who will make a
certain kind of purchase etc.
The result of a classica)on problem can be a
decision tree which shows how a new object
can be classied on the basis of the exis)ng
data

Classica)on
Data
age

cartype

risk

23

saloon

low

30

sports

low

36

saloon

low

25

hatchback

high

30

saloon

low

23

hatchback

high

30

hatchback

low

25

sports

high

18

saloon

low

Age
<= 25

> 25

Car Type
Saloon

Low risk

Low risk

sports,
hatchback
high risk

Es)ma)on
Similar to classica)on in that a model is
created
The model allows the output of a con)nuous
variable to be predicted
The model could be a mathema)cal func)on
to predict a value or could be a theorem
which then also predicts a value or perhaps
even a behaviour.

Clustering
Can we analyse the data for a set of objects
and iden)fy sub-groups and their membership
We may know the sub-groups and some
exis)ng members and want to know what
data helps iden)fy which cluster a new object
will belong to.

Clustering

Reproduced from Adriaans and Zantinge

Clustering K means example

The general idea of a clustering techniques is to divide


the population into partitions
Starts with an initial random selection of K partitions
Then points are moved into each partition using a
centroid calculation and a similarity measure in an
iterative process until the final set of clusters stabilises
The final set is then evaluated

Associa)on
Seeking co-occurrences of groups of data
items in a data set
Associa)on can be in )me i.e. a sequen)al
pa[ern
Can be very popular with retailers to target
adver)sing for related purchases and for store
layouts

Associa)on rules
Rules are of the form X => Y
where X and Y are distinct sets of items

Importance of a rule described by its


support and its confidence
Support : % of transactions containing X
and Y
Confidence: % of transactions with X that
also contain Y

Associa)on rules
All transactions
Transactions
with X
Transactions
with X and Y
Transactions
with Y

Support of X=>Y = Support of Y=>X =


3/10 = 30%
Confidence of X=>Y = = 75%
Confidence of Y=>X = 3/5 = 60%

Associa)on rules example


Transaction
1
2
3
4
5
Rule
Milk => Eggs
Eggs => Tea
sugar => {butter, milk}

Items bought
milk, eggs, tea
butter, milk, sugar, tea
biscuits, sugar, eggs
tea, coffee, eggs
coffee, chocolate, sugar
Support, Confidence
20%, 50%
40%, 66.7%
20%, 33.3%

Associa)on - issues
number of rules grows exponentially with number
of items
User to specify
Minimum Support (e.g. 10%) and
Minimum Confidence (e.g. 70%) levels
Which rules are interesting - define interesting
Negative rules can also be interesting
70% buying crisps => do not buy cream
absence implies millions of useless rules!

Hierarchies
Items are grouped
e.g. pen, pencil are writing tools
Can have different rules for groups than for
individual items
e.g., strong positive association between
crisps and biscuits, but negative
associations lower in hierarchy
use to define interesting
e.g. rules across groups can be more
interesting than rules within groups

Hierarchies
+ve
Crisps

Biscuits

C
-ve

+ve

X
-ve

Process
Cleansing, quality

Input data
from repository

Data
Pre-processing

Mining patterns

Data
Post-processing

Redrawn from Du, p14


Output patterns

Pre-processing
We need to understand the
data that we are using type
and quality
This will inform the mining
technique to be used
Data visualisa)on can also
inform the mining process

Target

Precise, inaccurate, biased


Precise, accurate, unbiased
imprecise, inaccurate, biased
imprecise, accurate, unbiased

DM vs. Query Tools


If you know what you want, use SQL (the database
query language)
SQL finds data under known constraints
SQL cannot readily find hidden knowledge
DM finds hidden nuggets
DM can find interesting patterns, irregularities and
optimal clusters
DM can use repeated SQL queries
DM gives more possibilities
DM requires a good foundation in the data

Reading

Hongbo Du (generally online resource)


Adriaans and Zantinge (a small book)
Witten & Frank (the WEKA software)
Christopher Westphal: Data mining for
intelligence, fraud, & criminal detection :
advanced analytics & information sharing
technologies
Marcus Maloof (e-book on Dawsonera)
Machine Learning and Data Mining for
Computer Security