Lec 1

Lecture 1: Introduction to Data Mining
7CCSMDM1 Data Mining
Dr Dimitrios Letsios
Department of Informatics
King’s College London
1 / 72
Lecture Contents
I Section 1: Basic Elements

I Definition, Real-World Applications
I Data Sets, Patterns, General Picture
I Section 2: Conceptual Framework

I Concrete Tasks
I Process
I Relation to Other Fields
I Section 3: Models
I Knowledge Representations
I Evaluation
2 / 72
Definition
Main Data Mining Goal

I Extract information from data, i.e. understand and take
advantage of it.
I Computers allow generating, managing, processing, and

communicating data and information.
I Data is raw, unorganized facts not a priori useful.
I If data is processed, organized, structured, and meaningfully
presented in some context, then it becomes information.
I Information is hidden in the data and understanding tends to
decrease as the volume of data increases.
3 / 72
Definition (2)
Data Mining Definition [WFH3, Section 1.1]
The process of discovering patterns in data.

I The process must be automated.
I The patterns must be meaningful and useful, i.e. lead to some
benefit and inform future decisions.
I Data mining works on existing data, i.e. data that has already
been generated, by people, machines, processes, etc.
I A pattern can be thought as a series of data that repeat in a

recognizable way.
I Finding patterns in data involves:
1. Identifying patterns
2. Validating patterns
3. Using patterns for predictions
4 / 72
Real-World Applications
Web Data
I PageRank assigns measures to web pages, based on online
search query relevance (Google).
I Email filtering classifies new messages as spams or hams.
I Online advertising based on users with similar purchases.
I Social media identify users with similar preferences.
5 / 72
Real-World Applications (2)
Marketing and Sales

I Identifying customers likely to defect and fight churn.
I Market basket analysis for personalised offers.
Risk
I Statistical calculation of bank loan default risk.
I Anticipated job candidate performance in recruitments.
6 / 72
Real-World Applications (3)
Images
I Oil spill or deforestation detection from satellite images.
I Currency recognition in automated payment machines.
I Face recognition for police surveillance.
Engineering
I Power demand forecasting for electricity suppliers.
I Failure prediction for machine maintenance in manufacturing.
7 / 72
Data Sets
Contact Lens Data Set [WFH3, Table 1.1]
8 / 72
Data Sets (2)
Nominal Weather Data Set [WFH3, Table 1.2]
9 / 72
Data Sets (3)
Numeric Weather Data Set [WFH3, Table 1.3]
10 / 72
Data Sets (4)
CPU Performance Data Set [WFH3, Table 1.5]
11 / 72
Data Sets (5)
Main Data Set Elements

I Attributes
I Instances
I Attributes or Features or Columns:

I Characterise each data set entry, i.e. specify the data set form.
I E.g. types of conditions considered in the weather data set.
I Attributes might depend to each other.
I Instances or Examples or Rows
I A set of values, one for each attribute.
I Typically, instances are considered to be independent.
I However, there might be relationships between instances.
12 / 72
Data Sets (6)
Family Tree [WFH3, Figure 2.1]
13 / 72
Data Sets (7)
Attribute Types:
I Numeric: Continuous or discrete with well-defined distance
between values.
I Nominal: Categorical.
I Dichotomous: Binary or boolean or yes/no.
I Ordinal: Ordered but without well-defined distance, e.g. poor,
reasonable, good and excellent health quality.
I Interval: Ordered, but also measured in fixed units, e.g. cool,
mild and hot temperatures.
14 / 72
Data Sets (8)
Attribute-Relation File Format (ARFF) [WFH3, Figure 2.2]
15 / 72
Data Sets (9)
I Lectures include simple data sets which are appropriate for

learning because they expose different issues and challenges.
I Practicals use a range of data sets from online sources.
I Often, real-world data sets:
I contain thousands or millions of entries,
I are incomplete,
I are noisy,
I incorporate randomness,
I are sparse.
16 / 72
Data Sets (10)
I Data preparation can be a significant part of the data mining

process and may require:
I Assembly
I Integration
I Cleaning
I Transformation
17 / 72
Data Sets (11)
Feature Engineering
The process of transforming raw data by selecting the most
suitable attributes for the data mining problem to be solved.
I Significant part of the data preparation time before modeling.

I Coming up with appropriate attributes can be difficult,
time-consuming and requires experience.
I May significantly affect data mining methods.
18 / 72
Patterns
I Allow making non-trivial predictions for new data.

I Black box, i.e. incomprehensible with hidden structure.
I Transparent with visible structure.
I Structural patterns:
I Capture and explain data aspects in an explicit way.
I Can be used for better-informed decisions.
I E.g. rules in the form if-then-else.
19 / 72
Patterns (2)
Nominal Weather Data Set [WFH3, Table 1.2]
20 / 72
Patterns (3)
Classification Rule
Attribute values predict the label.
i f ( o u t l o o k == s u n n y ) and ( h u m i d i t y == h i g h ) :
p l a y = no
Patterns can be captured by different types of models:

I Linear equations.
I Clusters, i.e. meaningful groups of data.
I Tree structures.
21 / 72
Patterns (4)
Weather Data Set Elements

I The input is four attributes or features:
I outlook = { sunny, overcast, rainy }
I temperature = { hot, mild, cool }
I humidity = { high, normal }
I windy = { true, false }
I The ouput is a decision, i.e. one label or class:
I play = { yes, no }
I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.

I Only 14 cases are present in the data set.
I A rule may use one or more attributes to make the right
decision (i.e. select the correct label).
I Rules obtained for a data set might not be good.
22 / 72
Patterns (5)
I Data mining aims to construct models from the data.

I If the data set is complete, then the rules produce 100%
correct predictions.
I If the data set is incomplete, then the rules may produce
incorrect predictions because information is missing.
I Real-world data sets are typically incomplete and we aim in
the best possible rules.
23 / 72
General Picture
I Data mining is a process for exploring data to discover

meaningful patterns.
I Given a data set, we aim to construct models expressing these
patterns using some form of knowledge representation.
24 / 72
Lecture Contents


I Concrete Tasks
I Process
I Section 3: Models
I Evaluation
25 / 72
Concrete Tasks
Examples:
I Classification models relationships between data elements to
predict classes or labels.
I Regression models relationships between data elements to
predict numeric quantities.
I Clustering models relationships of instances to group them so
that instances in the same group are similar.
I Association models relationships between attributes.
26 / 72
Concrete Tasks (2)
Classification:
I The data is classified, e.g. people can be labelled as Covid
positive, or negative based on their symptoms.
I Models ways that attributes determine the class of instances.
I Supervised learning task because it uses already classified
instances to predictions for new instances.
27 / 72
Concrete Tasks (3)
Regression:
I Models ways that attributes determine a numeric value.
I Variant of classification, but without discrete classes.
I Supervised learning task, similarly to classification.
I Often, the produced model is more interesting than predicted
values, e.g. what attributes affect car prices.
28 / 72
Concrete Tasks (4)
Clustering:
I Models similarity between instances and divides them into
groups so that instances in the same group are more similar
than instances in different groups.
I E.g. partition customers into groups.
I By labelling the clusters, we may use them in meaningful ways.
I Unsupervised learning task because the data is not labelled.
29 / 72
Concrete Tasks (5)
Association:
I Models how some attributes determine other attributes.
I No specific class or label.
I May examine any subset of attributes to predict any other
disjoint subset attributes.
I Usually involve only nominal data.
I E.g. use supermarket data, to identify combinations of
products that occur together in transactions.
30 / 72
Process
I Decomposes the data mining process into a number of steps.

I Allows distinguishing various issues.
I Provides a methodology for implementing data mining tasks.
31 / 72
Process (2)
Data Mining Process
Data containing
Examples
Question of
Evaluate
Interest
Score Model
Prediction or
Insight
32 / 72
Process (3)
Step 1: Objective Specification

I Identify the data mining problem type.
I Supervised learning:
I There is a target attribute.
I If nominal, then classification. E.g. to play or not in the
weather data set.
I If numeric, then predition. E.g. predict power value in the CPU
performance data set.
I Unsupervised learning:
I There is no target attribute.
I Cluster instances into groups of similarity.
I Find attribute correlations or associations.
I There exist other data mining tasks for other types of data.
33 / 72
Process (4)
Step 2: Data Exploration

I Visualise the data, e.g. using histograms or scatter plots.
I Confirm that the objective can be achieved with the data set.
I In this module, we first select the methods and then an

appropriate data set.
I In real-world applications, we typically begin with the data
and then select an appropriate method.
34 / 72
Process (5)
Step 3: Data Cleaning

I Fix any problems with the data.
I Confirm there is enough data, i.e. broad and deep.

I With very sparse data, data mining might not be effective.
I Rule of thumb: the more, the better.
I However, very large data sets can be problematic when (i) the
target variable appears in extremely rare patterns, or (ii) model
building is very resource consuming.
I Check whether there are imprecise or missing values.
I Verify that the data set is representative and not biased
35 / 72
Process (6)
Step 4: Model Building

I Select the most appropriate model for the data.
I The data may contain:

I Discrete or continuous numbers.
I Categorical, numeric, or mixed values.
I Grayscale or coloured images.
36 / 72
Process (7)
Step 5: Model Evaluation

I Assess whether the model achieves the desiderata.
I Measure accuracy, i.e. how well the model performs.

I Use both existing and new data by partitioning the data into:
I Training set for model building.
I Validation set for model selection.
I Test set for model evaluation with unseen data.
I Measure accuracy in all training, validation and test sets.
I Overfitting: the model is very tailored to the training
instances and does not generalize well to new instances.
37 / 72
Process (8)
Step 6: Repeat
I Usually, multiple iterations of the aforementioned steps are
required to build a good enough model.
I Revise the performed steps, adapt and reiterate.
38 / 72
Relation to Other Fields
I Data Mining (DM) is an interdisciplinary field using

approaches and techniques from multiple fieds, including:
I Artificial Intelligence (AI) and Machine Learning (ML)
I Statistics (Stats)
I Algorithms and Mathematical Optimization
39 / 72
Relation to Other Fields (2)
I DM is strongly related to ML and Stats.
I These fields share methods, but use them in different ways
and for different reasons.
40 / 72
When a machine learns? [WFH3, Section 1.1]
Subjects learn when they change their behaviour in a way that

makes them perform better in the future.
I The above statement emphasises on performance rather than

knowledge, which can be measured by comparing past
behaviour to present and future behaviour.
I There is a difference between learning and adaptation. E.g.
the adaptation of shoe to the shape of a foot is not learning.
I Learning implies purpose, i.e. intention.
I There are philosophical questions here.
41 / 72
I The application of ML to DM is not a philosophical question.

I DM involves learning in practical sense, i.e. finding and
describing well-structured patterns in the data.
I Input: data containing a set of examples.
I Output: explicit knowledge representation.
I DM involves the efficient acquisition of knowledge, but with
the ability of using it.
42 / 72
I ML and Stats have different histories, but similar methods

have been parallelly developed in the two fields. E.g.
I Generating decision trees from examples.
I Nearest-neighbour classification.
I ML and Stats have different goals:
I ML aims in the most accurate predictions.
I Stats infer variable relationships and test hypotheses.
I Nowadays, there is significant overlap between the two.
I Many DM techniques require statistical thinking.
43 / 72
Lecture Contents


I Concrete Tasks
I Process
I Section 3: Models
I Evaluation
44 / 72
Knowledge Representations
I Decision Tables
I Trees
I Rules
I Linear Models
I Instance-Based Representations
I Clusters
45 / 72
Tables
Decision Table
I Concise visual representation for specifying which actions to
perform based on given conditions.
I Contains a set of attributes and a decision label for each
unique set of attribute values.
46 / 72
Trees
Building Blocks
I Nodes: specify decisions to be made.
I Branches from a node represent possible alternatives.
I A branch connects a parent node to one of its child nodes.
I The very top node without a parent is called root.
I The very bottom nodes without a child are called leaves.
47 / 72
Trees (2)
Decision Trees:
I Branches may involve a single or multiple attributes.
I We examine the value of an attribute and branch based on
equality or inequality.
i f ( t e m p e r a t u r e < 80 ) :
branch l e f t
else :
branch r i g h t
I The alternatives of a decision can be:

I two-way such as yes or no,
I three-way such as <, =, or >,
I multi-way.
48 / 72
Trees (3)
Decision Trees:
I A path is a sequence of nodes such that each node is the
child of the previous node in the sequence.
I An attribute can be tested more than once in a path.
I In a classification context:
I A leaf specifies a class.
I Each instance satisfying all decisions of the corresponding path
from the root to the leaf is assigned this class.
49 / 72
Trees (4)
Missing Value Problem

It is unclear which branch should be considered if an attribute
value is missing.
I Possible solutions:
I Ignore all instances with missing values.
I Each attribute may get the value missing.
I Set the most popular choice for each missing attribute value.
I Make a probabilistic (weighted) choice for each missing
attribute value, based on the other instances.
I All these solutions propagate errors, especially when the
number of missing values increases.
50 / 72
Trees (5)
Functional Tree
I Computes a function of multiple attribute values in each node.
I Branches based on the value returned by the function.
i f ( petal length ∗ petal width > threshold ) :
make d e c i s i o n
else :
make d i f f e r e n t d e c i s i o n
51 / 72
Trees (6)
Regression Tree
I Predicts numeric values.
I Each node branches on the value of an attribute or on the
value of a function of the attributes.
I A leaf specifies a predicted value for corresponding instances.
52 / 72
Trees (7)
Model Trees
I Similar to a regression tree, except that a regression
equation predicts the numeric output value in each leaf.
I A regression equation predicts a numeric quantity as a
function of the attributes.
I More sophisticated than linear regression and regression trees.
53 / 72
Rules
Rule
I An expression in if-then format.
I The if part is the pre-condition or antecedent and consists
of a series of tests.
I The then part is the conclusion or consequent and assigns
values to one or more attributes.
The pre-condition may contain multiple clauses in the form of:

I Conjunction, i.e. tests linked by and, meaning that all tests
must be true to fire the rule.
I Disjunction, i.e. tests linked by or, meaning that at least one
test must be true to fire the rule.
I General logic expressions, i.e. tests linked by different
logical operators (and/or).
54 / 72
Rules (2)
Classification rules:
I Predict the class or label of an instance.
I Can be derived from a decision tree.
I One rule can be constructed for each leaf of the tree:
I The pre-condition contains a clause for each decision along the
path from the root to the leaf.
I The conclusion is the class of the leaf.
I Rule sets constructed in this way may contain redundancies,
especially if multiple leaves contain the same class.
55 / 72
Rules (3)
I Transforming a set of rules into a decision tree is also
possible, but not straightforward.
I The difficulty is the order of tests, starting from the root.
I The replicated subtree problem may occur, i.e. no matter
which rule is chosen first, the other is replicated in the tree.
I Sometimes, classification rules can be significantly more
compact than decision trees.
i f a and b :
x
i f c and d :
x
56 / 72
Rules (4)
I A set of rules may fail to classify an instance.

I These situations cannot happen with decision trees.
I Ordered set of rules:
I Should be applied based on the given order (decision list).
I An individual rule out of the list may be incorrect.
I Unordered sets of rules:
I Each rule represents an independent piece of knowledge.
I Different rules may lead to different classes for one instance.
57 / 72
Rules (5)
Association rules:
I Predict an attribute of an instance.
I Similar to classification rules except that they can predict
combinations of attributes too.
I Express different regularities in the data set.
I Many different association rules, even in tiny data sets.
Interesting association rules:

I High coverage and accuracy.
I Coverage: number of correctly predicted instances.
I Accuracy: % correctly predicted instances over all instances.
I Typically, we seek rules with coverage and accuracy above
prescribed thresholds.
58 / 72
Rules (6)
Learning Rules:
I By adding new rules and refining existing rules while more
instances are added in the training set.
I A refinement may add another conjunctive clause (and) to a
pre-condition.
Rules may:
I contain functions of attribute values, e.g. area( rectangle ).
I compare attribute values or functions of them, e.g.
area( rectangle )>width(rectangle)
I recursively concern different data set parts, e.g.
tallerThan ( rectangle , triangle )
59 / 72
Linear Models
I A linear model is a weighted sum of attribute values.
I E.g. PRP = 2.47 · CACH + 37.06.
I All attribute values must be numeric.
I Typically visualised as a 2D scatter plot with a regression
line, i.e. a linear function that best represents the data.
60 / 72
Linear Models (2)
I Linear models can be applied to classification problems, by
defining decision boundaries separating instances that
belong to different classes.
I E.g. 0.5 · PL + 0.8 · PW = 2.0.
61 / 72
Instance-Based Representations
Instance-Based Learning
I Instead of creating models, memorise actual instances.
I Instances are knowledge representation themselves.
I For new instances, search their closest ones in the training set.
I All work is done when classifying new instances, rather than

when processing the training set.
I Instances are compared using a distance metric.
I The closest training instances are referred to as nearest
neighbours.
62 / 72
Instance-Based Representations (2)
Euclidean Distance
Metric computing the distance between instances i and i 0 with
numeric attributes.
s
n
d(i, i 0 ) = ∑ (xi,j − xi ,j )2
0
j=1
I d(i, i 0 ): distance between instances i and i 0 .

I n: number of attributes.
I xi,j : value of attribute j for instance i.
I xi 0 ,j : value of attribute j for instance i 0 .
63 / 72
Hamming Distance
Metric computing the distance between instances i and i 0 with
nominal attributes.
d(i, i 0 ) : number of attributes at which i and i 0 differ.
(
0, if xi,j = xi 0 ,j
I Attribute j contributes to d(i, i 0 ) :
1, if xi,j 6= xi 0 ,j
I Contributes 0 if the attribute values are the same.
I Contributes 1 if the attribute values are different.
64 / 72
I Often, it is not desirable to store all training instances.
I Deciding the (i) saved and (ii) discarded instances is an issue.
I Even though instance-based methods do not learn an explicit
structure, the instances and distance metric specify
boundaries distinguishing different classes.
I Some instance-based methods create rectagular regions
containing instances of the same class.
65 / 72
Clusters
Clustering
Partitions the training set into regions which can be:
I non-overlapping, i.e. each instance is in exactly one cluster,
I overlapping, i.e. an instance may appear in multiple clusters.
66 / 72
Clusters (2)
Dendrogram (Hierarchical Clustering)

I Type of tree diagram with an hierchical structure of clusters.
I The top level partitions the space into two or more groups.
I These groups are further partitioned into subgroups and so on.
67 / 72
Model Evaluation
https://xkcd.com/242/
68 / 72
Model Evaluation (2)
I How good is the model?

I How does it perform on known data?
I How well does it predict for new data?
I A score function or error function computes the differences
between the predictions and the actual outcome.
I Typically, we want to maximize score or minimize error.
69 / 72
I Different data mining tasks use different score functions.

I We will cover scoring and evaluation approaches together
with data mining methods.
I The evaluation also depends on the type of the modelled data.
I For example, model evaluation may require:
I measuring quantitative differences,
I counting the number of correct predictions,
I statistical measures, e.g. t-test.
70 / 72
I Typically, we divide the data set into:

I the training set for model building,
I the validation set for model selection,
I the test set for model evaluation.
I If the training set or test set is not a representative sample,
then we will not build a strong model.
I A model can be as good as the data used to construct it.
71 / 72
Lecture Contents


I Concrete Tasks
I Process
I Section 3: Models
I Evaluation
72 / 72

Lec 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 1

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction to Data Mining

7CCSMDM1 Data Mining

I Section 1: Basic Elements

I Section 2: Conceptual Framework

Main Data Mining Goal

I Computers allow generating, managing, processing, and

Data Mining Definition [WFH3, Section 1.1]

The process of discovering patterns in data.

I A pattern can be thought as a series of data that repeat in a

Marketing and Sales

Nominal Weather Data Set [WFH3, Table 1.2]

Numeric Weather Data Set [WFH3, Table 1.3]

CPU Performance Data Set [WFH3, Table 1.5]

Main Data Set Elements

I Attributes or Features or Columns:

I Lectures include simple data sets which are appropriate for

I Data preparation can be a significant part of the data mining

I Significant part of the data preparation time before modeling.

I Allow making non-trivial predictions for new data.

Nominal Weather Data Set [WFH3, Table 1.2]

Patterns can be captured by different types of models:

Weather Data Set Elements

I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.

I Data mining aims to construct models from the data.

I Data mining is a process for exploring data to discover

I Section 1: Basic Elements

I Section 2: Conceptual Framework

I Decomposes the data mining process into a number of steps.

Step 1: Objective Specification

Step 2: Data Exploration

I In this module, we first select the methods and then an

Step 3: Data Cleaning

I Confirm there is enough data, i.e. broad and deep.

Step 4: Model Building

I The data may contain:

Step 5: Model Evaluation

I Measure accuracy, i.e. how well the model performs.

I Data Mining (DM) is an interdisciplinary field using

When a machine learns? [WFH3, Section 1.1]

Subjects learn when they change their behaviour in a way that

I The above statement emphasises on performance rather than

I The application of ML to DM is not a philosophical question.

I ML and Stats have different histories, but similar methods

I Section 1: Basic Elements

I Section 2: Conceptual Framework

I The alternatives of a decision can be:

Missing Value Problem

The pre-condition may contain multiple clauses in the form of:

I A set of rules may fail to classify an instance.

Interesting association rules:

I All work is done when classifying new instances, rather than

I d(i, i 0 ): distance between instances i and i 0 .

d(i, i 0 ) : number of attributes at which i and i 0 differ.

Dendrogram (Hierarchical Clustering)

I How good is the model?

I Different data mining tasks use different score functions.

I Typically, we divide the data set into:

I Section 1: Basic Elements

I Section 2: Conceptual Framework

You might also like