Professional Documents
Culture Documents
"The non trivial extraction of implicit, previously unknown, and potentially useful
information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
• Databases
• Statistics
• High Performance Computing
• Machine Learning
• Visualization
• Mathematics
Related Technologies
Data Warehouse
• Data Warehouse: centralized data repository which can be queried for business
benefit.
• Data Warehousing makes it possible to
o extract archived operational data
o overcome inconsistencies between different legacy data formats
o integrate data throughout an enterprise, regardless of location, format, or
communication requirements
o incorporate additional or expert information
Statistical Analysis
Classification
• DM system learns from examples or the data how to partition or classify the data i.e.
it formulates classification rules
• Example - customer database in a bank
o Question - Is a new customer applying for a loan a good investment or not?
o Typical rule formulated:
if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes
then INVESTMENT_TYPE = good
Association
Sequence/Temporal
• Data pre-processing
o Heterogeneity resolution
o Data cleansing
o Data transformation
o Data reduction
o Discretization and generating concept hierarchies
• Creating a data model: applying Data Mining tools to extract knowledge from data
• Testing the model: the performance of the model (e.g. accuracy, completeness) is
tested on independent data (not used to create the model)
• Interpretation and evaluation: the user bias can direct DM tools to areas of interest
o Attributes of interest in databases
o Goal of discovery
o Domain knowledge
o Prior knowledge or belief about the domain
Techniques
• Neural Networks
o a trained neural network can be thought of as an "expert" in the category of
information it has been given to analyze
o provides projections given new situations of interest and answers "what if"
questions
o problems include:
the resulting network is viewed as a black box
no explanation of the results is given i.e. difficult for the user to
interpret the results
difficult to incorporate user intervention
slow to train due to their iterative nature
• Decision trees
o used to represent knowledge
o built using a training set of data and can then be used to classify new objects
o problems are:
opaque structure - difficult to understand
missing data can cause performance problems
they become cumbersome for large data sets
• Rules
o probably the most common form of representation
o tend to be simple and intuitive
o unstructured and less rigid
o problems are:
difficult to maintain
inadequate to represent many types of knowledge
o Example format: if X then Y
• Credit Assessment
• Stock Market Prediction
• Fault Diagnosis in Production Systems
• Medical Discovery
• Fraud Detection
• Hazard Forecasting
• Buying Trends Analysis
• Organizational Restructuring
• Target Mailing
• Knowledge Acquisition
• Scientific Discovery
• Semantics based Performance Enhancement of DBMS
Assume we have made a record of the weather conditions during a two-week period, along
with the decisions of a tennis player whether or not to play tennis on each particular day.
Thus we have generated tuples (or examples, instances) consisting of values of four
independent variables (outlook, temperature, humidity, windy) and one dependent
variable (play). See the textbook for a detailed description.
DBMS
By querying a DBMS containing the above table we may answer questions like:
• What was the temperature in the sunny days? {85, 80, 72, 69, 75}
• Which days the humidity was less than 75? {6, 7, 9, 11}
• Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the humidity was less than 75?
The intersection of the above two: {11}
OLAP
Using OLAP we can create a Multidimensional Model of our data (Data Cube). For
example using the dimensions: time, outlook and play we can create the following model.
Obviously here time represents the days grouped in weeks (week 1 - days 1, 2, 3, 4, 5, 6, 7;
week 2 - days 8, 9, 10, 11, 12, 13, 14) over the vertical axis. The outlook is shown along the
horizontal axis and the third dimension play is shown in each individual cell as a pair of
values corresponding to the two values along this dimension - yes / no. Thus in the upper left
corner of the cube we have the total over all weeks and all outlook values.
By observing the data cube we can easily identify some important properties of the data, find
regularities or patterns. For example, the third column clearly shows that if the outlook is
overcast the play attribute is always yes. This may be put as a rule:
o week 1
day 1
day 2
day 3
day 4
day 5
day 6
day 7
o week 2
day 8
day 9
day 10
day 11
day 12
day 13
day 14
day 15
The drill-down operation is based on climbing down the concept hierarchy, so that we get the
following data cube:
The reverse of drill-down (called roll-up) applied to this data cube results in the previous
cube with two values (week 1 and week 2) along the time dimension.
Data Mining
By applying various Data Mining techniques we can find associations and regularities in our
data, extract knowledge in the forms of rules, decision trees etc., or just predict the value of
the dependent variable (play) in new situations (tuples). Here are some examples (all
produced by Weka):
To find associations in our data we first discretize the numeric attributes (a part of the data
pre-processing stage in data mining). Thus we group the temperature values in three
intervals (hot, mild, cool) and humidity values in two (high, normal) and substitute the values
in data with the corresponding names. Then we apply the Apriori algorithm and get the
following association rules:
1. humidity=normal windy=false 4 ==> play=yes (4, 1)
2. temperature=cool 4 ==> humidity=normal (4, 1)
3. outlook=overcast 4 ==> play=yes (4, 1)
4. temperature=cool play=yes 3 ==> humidity=normal (3, 1)
5. outlook=rainy windy=false 3 ==> play=yes (3, 1)
6. outlook=rainy play=yes 3 ==> windy=false (3, 1)
7. outlook=sunny humidity=high 3 ==> play=no (3, 1)
8. outlook=sunny play=no 3 ==> humidity=high (3, 1)
9. temperature=cool windy=false 2 ==> humidity=normal play=yes (2, 1)
10. temperature=cool humidity=normal windy=false 2 ==> play=yes (2, 1)
These rules show some attribute values sets (the so called item sets) that appear frequently in
the data. The numbers after each rule show the support (the number of occurrences of the
item set in the data) and the confidence (accuracy) of the rule. Interestingly, rule 3 is the
same as the one that we produced by observing the data cube.
Using the ID3 algorithm we can produce the following decision tree (shown as a horizontal
tree):
• outlook = sunny
o humidity = high: no
o humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
o windy = true: no
o windy = false: yes
The decision tree consists of decision nodes that test the values of their corresponding
attribute. Each value of this attribute leads to a subtree and so on, until the leaves of the tree
are reached. They determine the value of the dependent variable. Using a decision tree we
can classify new tuples (not used to generate the tree). For example, according to the above
tree the tuple {sunny, mild, normal, false} will be classified under play=yes.
A decision trees can be represented as a set of rules, where each rule represents a path
through the tree from the root to a leaf. Other Data Mining techniques can produce rules
directly. For example the Prism algorithm available in Weka generates the following rules.
Prediction methods
Data Mining offers techniques to predict the value of the dependent variable directly without
first generating a model. One of the most popular approaches for this purpose is based of
statistical methods. It uses the Bayes rule to predict the probability of each value of the
dependent variable given the values of the independent variables. For example, applying
Bayes to the new tuple discussed above we get:
P(play=yes | outlook=sunny, temperature=mild, humidity=normal, windy=false)
= 0.8
P(play=no | outlook=sunny, temperature=mild, humidity=normal, windy=false)
= 0.2
Data Mining: A hot buzzword for a class of database applications that look for hidden
patterns in a group of data. For example, data mining software can help retail companies find
customers with common interests. The term is commonly misused to describe software that
presents data in new ways. True data mining software doesn't just change the presentation,
but actually discovers previously u