You are on page 1of 72

Lecture 1: Introduction to Data Mining

7CCSMDM1 Data Mining

Dr Dimitrios Letsios

Department of Informatics
King’s College London

1 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

2 / 72
Definition

Main Data Mining Goal


I Extract information from data, i.e. understand and take
advantage of it.

I Computers allow generating, managing, processing, and


communicating data and information.
I Data is raw, unorganized facts not a priori useful.
I If data is processed, organized, structured, and meaningfully
presented in some context, then it becomes information.
I Information is hidden in the data and understanding tends to
decrease as the volume of data increases.

3 / 72
Definition (2)

Data Mining Definition [WFH3, Section 1.1]

The process of discovering patterns in data.


I The process must be automated.
I The patterns must be meaningful and useful, i.e. lead to some
benefit and inform future decisions.
I Data mining works on existing data, i.e. data that has already
been generated, by people, machines, processes, etc.

I A pattern can be thought as a series of data that repeat in a


recognizable way.
I Finding patterns in data involves:
1. Identifying patterns
2. Validating patterns
3. Using patterns for predictions

4 / 72
Real-World Applications

Web Data
I PageRank assigns measures to web pages, based on online
search query relevance (Google).
I Email filtering classifies new messages as spams or hams.
I Online advertising based on users with similar purchases.
I Social media identify users with similar preferences.

5 / 72
Real-World Applications (2)

Marketing and Sales


I Identifying customers likely to defect and fight churn.
I Market basket analysis for personalised offers.

Risk
I Statistical calculation of bank loan default risk.
I Anticipated job candidate performance in recruitments.

6 / 72
Real-World Applications (3)

Images
I Oil spill or deforestation detection from satellite images.
I Currency recognition in automated payment machines.
I Face recognition for police surveillance.

Engineering
I Power demand forecasting for electricity suppliers.
I Failure prediction for machine maintenance in manufacturing.

7 / 72
Data Sets
Contact Lens Data Set [WFH3, Table 1.1]

8 / 72
Data Sets (2)

Nominal Weather Data Set [WFH3, Table 1.2]

9 / 72
Data Sets (3)

Numeric Weather Data Set [WFH3, Table 1.3]

10 / 72
Data Sets (4)

CPU Performance Data Set [WFH3, Table 1.5]

11 / 72
Data Sets (5)

Main Data Set Elements


I Attributes
I Instances

I Attributes or Features or Columns:


I Characterise each data set entry, i.e. specify the data set form.
I E.g. types of conditions considered in the weather data set.
I Attributes might depend to each other.
I Instances or Examples or Rows
I A set of values, one for each attribute.
I Typically, instances are considered to be independent.
I However, there might be relationships between instances.

12 / 72
Data Sets (6)
Family Tree [WFH3, Figure 2.1]

13 / 72
Data Sets (7)

Attribute Types:
I Numeric: Continuous or discrete with well-defined distance
between values.
I Nominal: Categorical.
I Dichotomous: Binary or boolean or yes/no.
I Ordinal: Ordered but without well-defined distance, e.g. poor,
reasonable, good and excellent health quality.
I Interval: Ordered, but also measured in fixed units, e.g. cool,
mild and hot temperatures.

14 / 72
Data Sets (8)
Attribute-Relation File Format (ARFF) [WFH3, Figure 2.2]

15 / 72
Data Sets (9)

I Lectures include simple data sets which are appropriate for


learning because they expose different issues and challenges.
I Practicals use a range of data sets from online sources.
I Often, real-world data sets:
I contain thousands or millions of entries,
I are incomplete,
I are noisy,
I incorporate randomness,
I are sparse.

16 / 72
Data Sets (10)

I Data preparation can be a significant part of the data mining


process and may require:
I Assembly
I Integration
I Cleaning
I Transformation

17 / 72
Data Sets (11)

Feature Engineering
The process of transforming raw data by selecting the most
suitable attributes for the data mining problem to be solved.

I Significant part of the data preparation time before modeling.


I Coming up with appropriate attributes can be difficult,
time-consuming and requires experience.
I May significantly affect data mining methods.

18 / 72
Patterns

I Allow making non-trivial predictions for new data.


I Black box, i.e. incomprehensible with hidden structure.
I Transparent with visible structure.

I Structural patterns:
I Capture and explain data aspects in an explicit way.
I Can be used for better-informed decisions.
I E.g. rules in the form if-then-else.

19 / 72
Patterns (2)

Nominal Weather Data Set [WFH3, Table 1.2]

20 / 72
Patterns (3)

Classification Rule
Attribute values predict the label.
i f ( o u t l o o k == s u n n y ) and ( h u m i d i t y == h i g h ) :
p l a y = no

Patterns can be captured by different types of models:


I Linear equations.
I Clusters, i.e. meaningful groups of data.
I Tree structures.

21 / 72
Patterns (4)

Weather Data Set Elements


I The input is four attributes or features:
I outlook = { sunny, overcast, rainy }
I temperature = { hot, mild, cool }
I humidity = { high, normal }
I windy = { true, false }
I The ouput is a decision, i.e. one label or class:
I play = { yes, no }

I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.


I Only 14 cases are present in the data set.
I A rule may use one or more attributes to make the right
decision (i.e. select the correct label).
I Rules obtained for a data set might not be good.

22 / 72
Patterns (5)

I Data mining aims to construct models from the data.


I If the data set is complete, then the rules produce 100%
correct predictions.
I If the data set is incomplete, then the rules may produce
incorrect predictions because information is missing.
I Real-world data sets are typically incomplete and we aim in
the best possible rules.

23 / 72
General Picture

I Data mining is a process for exploring data to discover


meaningful patterns.
I Given a data set, we aim to construct models expressing these
patterns using some form of knowledge representation.

24 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

25 / 72
Concrete Tasks

Examples:
I Classification models relationships between data elements to
predict classes or labels.
I Regression models relationships between data elements to
predict numeric quantities.
I Clustering models relationships of instances to group them so
that instances in the same group are similar.
I Association models relationships between attributes.

26 / 72
Concrete Tasks (2)

Classification:
I The data is classified, e.g. people can be labelled as Covid
positive, or negative based on their symptoms.
I Models ways that attributes determine the class of instances.
I Supervised learning task because it uses already classified
instances to predictions for new instances.

27 / 72
Concrete Tasks (3)

Regression:
I Models ways that attributes determine a numeric value.
I Variant of classification, but without discrete classes.
I Supervised learning task, similarly to classification.
I Often, the produced model is more interesting than predicted
values, e.g. what attributes affect car prices.

28 / 72
Concrete Tasks (4)

Clustering:
I Models similarity between instances and divides them into
groups so that instances in the same group are more similar
than instances in different groups.
I E.g. partition customers into groups.
I By labelling the clusters, we may use them in meaningful ways.
I Unsupervised learning task because the data is not labelled.

29 / 72
Concrete Tasks (5)

Association:
I Models how some attributes determine other attributes.
I No specific class or label.
I May examine any subset of attributes to predict any other
disjoint subset attributes.
I Usually involve only nominal data.
I E.g. use supermarket data, to identify combinations of
products that occur together in transactions.

30 / 72
Process

I Decomposes the data mining process into a number of steps.


I Allows distinguishing various issues.
I Provides a methodology for implementing data mining tasks.

31 / 72
Process (2)
Data Mining Process

Data containing
Examples

Question of
Evaluate
Interest

Score Model

Prediction or
Insight

32 / 72
Process (3)

Step 1: Objective Specification


I Identify the data mining problem type.

I Supervised learning:
I There is a target attribute.
I If nominal, then classification. E.g. to play or not in the
weather data set.
I If numeric, then predition. E.g. predict power value in the CPU
performance data set.
I Unsupervised learning:
I There is no target attribute.
I Cluster instances into groups of similarity.
I Find attribute correlations or associations.
I There exist other data mining tasks for other types of data.

33 / 72
Process (4)

Step 2: Data Exploration


I Visualise the data, e.g. using histograms or scatter plots.
I Confirm that the objective can be achieved with the data set.

I In this module, we first select the methods and then an


appropriate data set.
I In real-world applications, we typically begin with the data
and then select an appropriate method.

34 / 72
Process (5)

Step 3: Data Cleaning


I Fix any problems with the data.

I Confirm there is enough data, i.e. broad and deep.


I With very sparse data, data mining might not be effective.
I Rule of thumb: the more, the better.
I However, very large data sets can be problematic when (i) the
target variable appears in extremely rare patterns, or (ii) model
building is very resource consuming.
I Check whether there are imprecise or missing values.
I Verify that the data set is representative and not biased

35 / 72
Process (6)

Step 4: Model Building


I Select the most appropriate model for the data.

I The data may contain:


I Discrete or continuous numbers.
I Categorical, numeric, or mixed values.
I Grayscale or coloured images.

36 / 72
Process (7)

Step 5: Model Evaluation


I Assess whether the model achieves the desiderata.

I Measure accuracy, i.e. how well the model performs.


I Use both existing and new data by partitioning the data into:
I Training set for model building.
I Validation set for model selection.
I Test set for model evaluation with unseen data.
I Measure accuracy in all training, validation and test sets.
I Overfitting: the model is very tailored to the training
instances and does not generalize well to new instances.

37 / 72
Process (8)

Step 6: Repeat
I Usually, multiple iterations of the aforementioned steps are
required to build a good enough model.
I Revise the performed steps, adapt and reiterate.

38 / 72
Relation to Other Fields

I Data Mining (DM) is an interdisciplinary field using


approaches and techniques from multiple fieds, including:
I Artificial Intelligence (AI) and Machine Learning (ML)
I Statistics (Stats)
I Algorithms and Mathematical Optimization

39 / 72
Relation to Other Fields (2)
I DM is strongly related to ML and Stats.
I These fields share methods, but use them in different ways
and for different reasons.

40 / 72
Relation to Other Fields (3)

When a machine learns? [WFH3, Section 1.1]

Subjects learn when they change their behaviour in a way that


makes them perform better in the future.

I The above statement emphasises on performance rather than


knowledge, which can be measured by comparing past
behaviour to present and future behaviour.
I There is a difference between learning and adaptation. E.g.
the adaptation of shoe to the shape of a foot is not learning.
I Learning implies purpose, i.e. intention.
I There are philosophical questions here.

41 / 72
Relation to Other Fields (4)

I The application of ML to DM is not a philosophical question.


I DM involves learning in practical sense, i.e. finding and
describing well-structured patterns in the data.
I Input: data containing a set of examples.
I Output: explicit knowledge representation.
I DM involves the efficient acquisition of knowledge, but with
the ability of using it.

42 / 72
Relation to Other Fields (5)

I ML and Stats have different histories, but similar methods


have been parallelly developed in the two fields. E.g.
I Generating decision trees from examples.
I Nearest-neighbour classification.
I ML and Stats have different goals:
I ML aims in the most accurate predictions.
I Stats infer variable relationships and test hypotheses.
I Nowadays, there is significant overlap between the two.
I Many DM techniques require statistical thinking.

43 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

44 / 72
Knowledge Representations

I Decision Tables
I Trees
I Rules
I Linear Models
I Instance-Based Representations
I Clusters

45 / 72
Tables
Decision Table
I Concise visual representation for specifying which actions to
perform based on given conditions.
I Contains a set of attributes and a decision label for each
unique set of attribute values.

46 / 72
Trees
Building Blocks
I Nodes: specify decisions to be made.
I Branches from a node represent possible alternatives.
I A branch connects a parent node to one of its child nodes.
I The very top node without a parent is called root.
I The very bottom nodes without a child are called leaves.

47 / 72
Trees (2)

Decision Trees:
I Branches may involve a single or multiple attributes.
I We examine the value of an attribute and branch based on
equality or inequality.
i f ( t e m p e r a t u r e < 80 ) :
branch l e f t
else :
branch r i g h t

I The alternatives of a decision can be:


I two-way such as yes or no,
I three-way such as <, =, or >,
I multi-way.

48 / 72
Trees (3)
Decision Trees:
I A path is a sequence of nodes such that each node is the
child of the previous node in the sequence.
I An attribute can be tested more than once in a path.
I In a classification context:
I A leaf specifies a class.
I Each instance satisfying all decisions of the corresponding path
from the root to the leaf is assigned this class.

49 / 72
Trees (4)

Missing Value Problem


It is unclear which branch should be considered if an attribute
value is missing.

I Possible solutions:
I Ignore all instances with missing values.
I Each attribute may get the value missing.
I Set the most popular choice for each missing attribute value.
I Make a probabilistic (weighted) choice for each missing
attribute value, based on the other instances.
I All these solutions propagate errors, especially when the
number of missing values increases.

50 / 72
Trees (5)

Functional Tree
I Computes a function of multiple attribute values in each node.
I Branches based on the value returned by the function.
i f ( petal length ∗ petal width > threshold ) :
make d e c i s i o n
else :
make d i f f e r e n t d e c i s i o n

51 / 72
Trees (6)
Regression Tree
I Predicts numeric values.
I Each node branches on the value of an attribute or on the
value of a function of the attributes.
I A leaf specifies a predicted value for corresponding instances.

52 / 72
Trees (7)
Model Trees
I Similar to a regression tree, except that a regression
equation predicts the numeric output value in each leaf.
I A regression equation predicts a numeric quantity as a
function of the attributes.
I More sophisticated than linear regression and regression trees.

53 / 72
Rules
Rule
I An expression in if-then format.
I The if part is the pre-condition or antecedent and consists
of a series of tests.
I The then part is the conclusion or consequent and assigns
values to one or more attributes.

The pre-condition may contain multiple clauses in the form of:


I Conjunction, i.e. tests linked by and, meaning that all tests
must be true to fire the rule.
I Disjunction, i.e. tests linked by or, meaning that at least one
test must be true to fire the rule.
I General logic expressions, i.e. tests linked by different
logical operators (and/or).

54 / 72
Rules (2)

Classification rules:
I Predict the class or label of an instance.
I Can be derived from a decision tree.
I One rule can be constructed for each leaf of the tree:
I The pre-condition contains a clause for each decision along the
path from the root to the leaf.
I The conclusion is the class of the leaf.
I Rule sets constructed in this way may contain redundancies,
especially if multiple leaves contain the same class.

55 / 72
Rules (3)
I Transforming a set of rules into a decision tree is also
possible, but not straightforward.
I The difficulty is the order of tests, starting from the root.
I The replicated subtree problem may occur, i.e. no matter
which rule is chosen first, the other is replicated in the tree.
I Sometimes, classification rules can be significantly more
compact than decision trees.

i f a and b :
x
i f c and d :
x

56 / 72
Rules (4)

I A set of rules may fail to classify an instance.


I These situations cannot happen with decision trees.
I Ordered set of rules:
I Should be applied based on the given order (decision list).
I An individual rule out of the list may be incorrect.
I Unordered sets of rules:
I Each rule represents an independent piece of knowledge.
I Different rules may lead to different classes for one instance.

57 / 72
Rules (5)
Association rules:
I Predict an attribute of an instance.
I Similar to classification rules except that they can predict
combinations of attributes too.
I Express different regularities in the data set.
I Many different association rules, even in tiny data sets.

Interesting association rules:


I High coverage and accuracy.
I Coverage: number of correctly predicted instances.
I Accuracy: % correctly predicted instances over all instances.
I Typically, we seek rules with coverage and accuracy above
prescribed thresholds.

58 / 72
Rules (6)

Learning Rules:
I By adding new rules and refining existing rules while more
instances are added in the training set.
I A refinement may add another conjunctive clause (and) to a
pre-condition.

Rules may:
I contain functions of attribute values, e.g. area( rectangle ).
I compare attribute values or functions of them, e.g.
area( rectangle )>width(rectangle)
I recursively concern different data set parts, e.g.
tallerThan ( rectangle , triangle )

59 / 72
Linear Models
I A linear model is a weighted sum of attribute values.
I E.g. PRP = 2.47 · CACH + 37.06.
I All attribute values must be numeric.
I Typically visualised as a 2D scatter plot with a regression
line, i.e. a linear function that best represents the data.

60 / 72
Linear Models (2)
I Linear models can be applied to classification problems, by
defining decision boundaries separating instances that
belong to different classes.
I E.g. 0.5 · PL + 0.8 · PW = 2.0.

61 / 72
Instance-Based Representations

Instance-Based Learning
I Instead of creating models, memorise actual instances.
I Instances are knowledge representation themselves.
I For new instances, search their closest ones in the training set.

I All work is done when classifying new instances, rather than


when processing the training set.
I Instances are compared using a distance metric.
I The closest training instances are referred to as nearest
neighbours.

62 / 72
Instance-Based Representations (2)

Euclidean Distance
Metric computing the distance between instances i and i 0 with
numeric attributes.
s
n
d(i, i 0 ) = ∑ (xi,j − xi ,j )2
0
j=1

I d(i, i 0 ): distance between instances i and i 0 .


I n: number of attributes.
I xi,j : value of attribute j for instance i.
I xi 0 ,j : value of attribute j for instance i 0 .

63 / 72
Instance-Based Representations (3)

Hamming Distance
Metric computing the distance between instances i and i 0 with
nominal attributes.

d(i, i 0 ) : number of attributes at which i and i 0 differ.

(
0, if xi,j = xi 0 ,j
I Attribute j contributes to d(i, i 0 ) :
1, if xi,j 6= xi 0 ,j
I Contributes 0 if the attribute values are the same.
I Contributes 1 if the attribute values are different.

64 / 72
Instance-Based Representations (4)
I Often, it is not desirable to store all training instances.
I Deciding the (i) saved and (ii) discarded instances is an issue.
I Even though instance-based methods do not learn an explicit
structure, the instances and distance metric specify
boundaries distinguishing different classes.
I Some instance-based methods create rectagular regions
containing instances of the same class.

65 / 72
Clusters

Clustering
Partitions the training set into regions which can be:
I non-overlapping, i.e. each instance is in exactly one cluster,
I overlapping, i.e. an instance may appear in multiple clusters.

66 / 72
Clusters (2)

Dendrogram (Hierarchical Clustering)


I Type of tree diagram with an hierchical structure of clusters.
I The top level partitions the space into two or more groups.
I These groups are further partitioned into subgroups and so on.

67 / 72
Model Evaluation

https://xkcd.com/242/

68 / 72
Model Evaluation (2)

I How good is the model?


I How does it perform on known data?
I How well does it predict for new data?
I A score function or error function computes the differences
between the predictions and the actual outcome.
I Typically, we want to maximize score or minimize error.

69 / 72
Model Evaluation (3)

I Different data mining tasks use different score functions.


I We will cover scoring and evaluation approaches together
with data mining methods.
I The evaluation also depends on the type of the modelled data.
I For example, model evaluation may require:
I measuring quantitative differences,
I counting the number of correct predictions,
I statistical measures, e.g. t-test.

70 / 72
Model Evaluation (4)

I Typically, we divide the data set into:


I the training set for model building,
I the validation set for model selection,
I the test set for model evaluation.
I If the training set or test set is not a representative sample,
then we will not build a strong model.
I A model can be as good as the data used to construct it.

71 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

72 / 72

You might also like