This action might not be possible to undo. Are you sure you want to continue?

Frank Klawonn

f.klawonn@fh-wolfenbuettel.de

Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbuettel http://public.rz.fh-wolfenbuettel.de/ klawonn

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.1/39

Basic references

Data Mining: Introductory and Advanced Topics. Prentice Hall, Upper Saddle River, NJ (2002). J. Han, M. Kamber: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005). D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA (2001). D.T. Larose: Data Mining Methods and Models. Wiley, Chichester (2006). D.T. Larose: Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, Chichester (2006). I.H. Witten, E. Frank: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005).

M.H. Dunham:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.2/39

Web references

http://www.cs.waikato.ac.nz/ ml/

Website for the book by

Witten/Frank with

**¯ Java software, ¯ example data sets, ¯ slides for teaching.
**

http://www.ics.uci.edu/ mlearn/MLRepository.html

(UCI repository). A collection of example data sets with various properties. A powerful statistics software.

http://www.r-project.org/

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.3/39

What is data mining?

Data mining is the analysis of (often large) observational data sets to ﬁnd unsuspected relationships and to summerize the data in novel ways that are both understandable and useful to the data owner. (Hand/Mannila/Smyth)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.4/39

Data Mining

**¯ In the context of data mining, it is quite usual that
**

the data to be examined were not necessarily collected for the speciﬁc purpose of analysis, in contrast to experimental data where a suitable experiment is designed to collect the data. Example: Money transactions within a bank must be documented for safety reasons. Nevertheless, requirements and customer preferences can be deduced from such data, for instance, how much money might be needed daily for new ATM.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.5/39

Data Mining

**¯ The data mining process should produce a result
**

that can be represented in a form that is understandable for the user of the data, not for a statistician. In the ideal case, the data mining methods are directly applied by the user of the data. In most cases, this does not work. This is why we are here!

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.6/39

Data Mining

**¯ Finding new or interesting relationships or patterns
**

in data is not a trivial task. Example: Applying a simple algorithm to hospital might ﬁnd the true, but completely uninteresting rule If pregnant, then woman.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.7/39

Data Mining

**¯ Data sets can be large.
**

This migt have severe consequences on the choice of suitable methods. Quadratic complexity in the number of data is very often not acceptable.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.8/39

Characteristics of data

**¯ refer to single instances (single objects, persons,
**

events, points in time etc.)

**¯ describe individual properties ¯ are often available in huge amounts (databases,
**

archives)

**¯ are usually easy to collect or to obtain (e.g. cash
**

registers with scanners in supermarkets, Internet)

¯ do not allow us to make predictions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.9/39

Characteristics of knowledge

**¯ refers to classes of instances (sets of objects,
**

persons, points in time etc.)

**¯ describes general patterns, structure, laws,
**

principles etc.

**¯ consists of as few statements as possible (this is
**

an objective!)

**¯ is usually difﬁcult to ﬁnd or to obtain (e.g. natural
**

laws, education)

¯ allows us to make predictions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.10/39

Data vs. knowledge

(Customer) data ID Name Age Sex Income . . . . . . . . . . . . . . . 2448 Miller 35 Male 6000 2449 Smith 39 Female 7000 . . . . . . . . . . . . . . .

Knowledge:

¡¡¡ ¡¡¡

80% of our customers are between 30 and 40 years old and earn between 5000$ and 9000$ per month.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.11/39

Criteria to assess knowledge Not all statements are equally important. equally substantial. clarity. unexpected) Data Mining – p. conditions of validity) Usefulness (relevance. equally useful. predictive power) Comprehensibility (simplicity. parsimony) Novelty (previously unknown.12/39 University of Applied Sciences Braunschweig/Wolfenbuettel . success in tests) Generality (range of validity. Assessment Criteria ¯ ¯ ¯ ¯ ¯ Correctness (probability. µ Knowledge must be assessed.

Is it possible to derive an algorithm from the data that decides for a new customer desiring a loan.13/39 . Large amounts of data of customers who have or have not returned their loans are available. whether the loan should be granted or not? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Examples A bank has decades of experience in giving loans to customers.

14/39 . income.Examples Problems/questions ¯ Are the data complete and correct? ¯ Is the complete available customer information (age. prediction? customer available? ) needed for the ¯ Is all the necessary information about the ¯ Are the data representative? ¯ Can an unambiguous prediction/decision be made? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. address.

given the measured frequency spectrum? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The parts are manually classiﬁed as “broken” or “OK”. The produced parts are stimulated by an acoustic signal and a frequency spectrum is measured.15/39 . Is it possible to derive an algorithm from the data that classiﬁes a new part as “broken” or “OK”.Examples A producer of porcelain wants to install an automatic quality check device that sorts out broken parts.

) needed for the prediction? ¯ Is all the necessary information about the customer available? ¯ Are the data representative? ¯ Can an unambiguous prediction/decision be made? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Problems/questions ¯ Are the data complete and correct? ¯ Is the complete available customer information (amount. customer address.16/39 . location of the transaction. previous customer history.Examples A bank provides credit cards for customers and wants to detect as many fraud transactions as possible.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Examples Churn detection: Given customer data and history. ﬁnd the possible candidates for churners.17/39 .

glass. plastic? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Examples Various properties of garbarge particles are measured (colour.18/39 . magnetism. weight. Is it possible to sort the particles automtically into groups like paper. metal. ).

an assignment of an “object” (customer. “churner/nonchurner”. porcelain part.(Supervised) Classiﬁcation Common property of all previous examples: Based on certain measurements/information. “paper/glass/metal/plastic”) is needed. Such tasks are called (supervised) classiﬁcation problems. “broken/OK”.19/39 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. garbage particle) to one group of a ﬁnite number of classes (grant loan “yes/no”.

In both cases. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Examples Given (selected) stock prices and indices of today. Given today’s weather conditions. predict the (local) temperature for tomorrow.20/39 . historical data for decades are available. predict a speciﬁc stock price or index value for tomorrow.

21/39 . temperature) is continuous. In contrast to classiﬁcation problems.Regression These problems are also supervised learning problems. It can take arbitrary real values. “Supervised” refers to the fact that the outcome/prediction for the given/historical data is known. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. the variable to be predicted (stock price. Such tasks are called regression problems.

22/39 . OAPs with medium capital) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. rich yuppies.Examples Given a customer database. Are the properties of the customers just randomly distributed or can the be grouped into typical customer segments? (poor young students.

) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. from region Frankfurt to region Munich.Examples Given the train tickets bought in Germany. Are there main connections that are used typically (at certain times/days)? (like from region Berlin to region Hamburg.23/39 .

(Unsupervised) Classiﬁcation In the two previous examples. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.24/39 . the data should be grouped into classes. The classes are not known in advance. usually solved by cluster analysis. Similar data should be put into the same class. Such tasks are called unsupervised classiﬁcation problems.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.25/39 .Examples Market basket analysis: Are there typical combinations of items that customers tend to buy together? For instance: Customers who buy wallpaper and paint also buy wallpaper paste in 90% of the cases.

Examples Analysis of molecular structures (for instance for the design of drugs): University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.26/39 .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. One is only interested in substructures of the data set.27/39 .Frequent item set mining Such tasks are called frequent item set mining and association rule mining. not in describing or covering the whole data set.

customers who tend to buy certain products) is signiﬁcantly over.or underrepresented. Subgroup discovery Subgroup discovery can be viewed as a partial classiﬁcation problem.28/39 . Not the whole data set needs to be classiﬁed.g. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Subgroup discovery aims at ﬁnding subsets where a given class (e. It is sufﬁcient to ﬁnd subsets with good classiﬁcation rates.

Does the company loose a customer group or is it winning a new one? Is the quality of certain materials or products changing? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.29/39 .Change detection Are their signiﬁcant changes in the data over time? e.g.

30/39 . incremental learning: ¯ Changes in the model parameters over time might be possible.Online learning/data mining ¯ Analysing the data online while they arrive in a ontinuous data stream. It is assumed that the source/model generating the data does not change over time. evolving systems: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The information derived from “old” might not be applicable to new data.

which trend will they follow in the near future?) ¯ (text) documents (How can documents be clustered into groups automatically?) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Examples: Complex data structures ¯ molecular structures (2D/3D.31/39 . graphs) ¯ images (How to retrieve images from an image database?) ¯ time series (Given the past development of stock market indices.

32/39 .CRISP CRoss-Industry Standard Process for Data Mining University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. and a preliminary plan designed to achieve the objectives.Business understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective. and then converting this knowledge into a data mining problem deﬁnition.33/39 .

Data understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data. to identify data quality problems. or to detect interesting subsets to form hypotheses for hidden information. to discover ﬁrst insights into the data. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.34/39 .

Tasks include table.Data preparation The data preparation phase covers all activities to construct the ﬁnal dataset (data that will be fed into the modelling tool(s)) from the initial raw data. and attribute selection as well as transformation and cleaning of data for modelling tools. and not in any prescribed order. Data preparation tasks are likely to be performed multiple times. record. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.35/39 .

Therefore. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. and their parameters are calibrated to optimal values. there are several techniques for the same data mining problem type.Modelling In this phase. stepping back to the data preparation phase is often needed. Typically. Some techniques have speciﬁc requirements on the form of data. various modelling techniques are selected and applied.36/39 .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. and review the steps executed to construct the model. At the end of this phase. a decision on the use of the data mining results should be reached. A key objective is to determine if there is some important business issue that has not been sufﬁciently considered. to be certain it properly achieves the business objectives. Before proceeding to ﬁnal deployment of the model.37/39 . it is important to more thoroughly evaluate the model. from a data analysis perspective.Evaluation At this stage in the project you have built a model (or models) that appears to have high quality.

In many cases it will be the customer. even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.Deployment Creation of the model is generally not the end of the project. However. Even if the purpose of the model is to increase knowledge of the data.38/39 . the knowledge gained will need to be organised and presented in a way that the customer can use it. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. who will carry out the deployment steps. Depending on the requirements. not the data analyst. the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

) and estimation of the model parameters Model validation and selection. outlier detection. transformation.Selected CRISP phases exploratory data analysis techniques. visualisation Data understanding: Data preparation: Modelling: feature extraction. regression model.39/39 . treatment of missing values Choice of the “model” (classiﬁer. Is the derived model suitable to make reliable predictions for new data? Evaluation: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. cluster analysis technique.

Statistical reasoning Deﬁnition. È´ is irrelevant for µ (independent of) È ´ µ holds. ). Let ¯ ¯ ¯ and two events with ¼ ( supports ± ( È´ µ µ È´ È´ µ µ ½. if È ´ speaks against ² µ holds. ). if È ´ È´ µ holds. if University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. ( ).1/106 .

2/106 University of Applied Sciences Braunschweig/Wolfenbuettel .Statistical reasoning should not be interpreted as some kind of probabilistic implication! Theorem. ± (a) (b) (c) (d) ± ¸ ² ± ¸ ² ² ¸ ± ¸ Data Mining – p.

Statistical reasoning " ": µ ´ Proof. (only for (a) as an example) È´ È´ µ ½ È´ È´ µ ½ È´ È´ µ È´ È´ µ " ": µ ½ µ ½ µ µ £ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.3/106 .

4/106 .Statistical reasoning Theorem. (a) (b) (c) (d) ± ¸ ± ² ¸ ² ² ¸ ± ¸ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Statistical reasoning Theorem.5/106 . (a) (b) (c) ± ² ± µ ± ² µ ² µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

È ´ µ ½ ¿. (a) ½ ¾ ¿ . È´ µ ½ ¿ (b) analogously (c) ½ ¾ . Choose ª ½ ¾ ¿ Å ª for Å ª. È´ µ ½ ¾ È ´ µ ¾ ¿. È ´ µ È ´ µ ½ ¿. È ´ µ ½ ¾ È ´ µ ½ ¾. È ´ µ ¼ £ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. ½ ¿ . È ´Å µ ¾ Proof. È´ µ È´ µ È ´ µ ¾ ¿.6/106 . ½ ¾ .Statistical reasoning .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. µ ± µ ± µ ± µ ± Cooresponding theorems are valid for ² and (a) (b) (c) (d) ± ± ± ± ± ± ± ± .Statistical reasoning Theorem.7/106 .

ª . È ´ ¾ ¿. È´ µ ¾ ¿ µ È´ È´ µ µ È´ ¾ µ È´ ¿. (only of (d) as an example) ½ ¾ ¿ ½ ¾ ¿ . .8/106 . È ´Å µ ½ ¾ µ µ ¿ Å . È´ £ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Statistical reasoning Proof. ª for Å ½ ¾ ª.

9/106 . if È ´ µ È ´ µ holds where ± È ´ µ È´ µ Theorem. ± ¸ ± University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Statistical reasoning Deﬁnition. The event supports the event under the general condition (event) ( ).

ª for Å ½ ¾ ¿ ª. ª ½ ¾ ¿ ½ ¾ .10/106 . . Within : ½ ¾ .Statistical reasoning Proof. ½ È ´ È´ µ µ ½ ¾ È ½ ¾ ´ µ ½ ¿ µ ¾ ¿ È´ £ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. È ´Å µ ½ Å .

then player two will choose a wheel.Probabilistic scissors. rock & paper Player one chooses on of the following wheels. Then both turn their wheel.2 5 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The one with the higher number is the winner. 0 .11/106 .5 2 1 0 5 4 3 0 .

5 2 1 0 5 4 3 0 .12/106 . rock & paper How to choose the wheels? 0 .2 5 È ´blue redµ È ´red greenµ È ´green blueµ ¼¾ ·¼ ¡¼ ¼ ¾ ¼ ½ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Probabilistic scissors.

13/106 .Causality "A statistical survey has shown that students receiving a grant perform better in their exams." ¯ Are the grants the reason that the students perform better. since they do not have to earn money for their studies and can spent more time for learning? Or do only students with better results in school or early university years receive a grant? ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Causality Two balls are drawn without replacement from a bag with two white and two black balls. What is the probability that the second ball will be white.14/106 . when the ﬁrst ball that was drawn is white? È ´Ï¾ Ï½µ University of Applied Sciences Braunschweig/Wolfenbuettel ½ ¿ Data Mining – p.

What is the probability that the ﬁrst ball was white. since the second ball cannot have any inﬂuence on the ﬁrst ball.5. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. but hidden from the observer. when the second ball that was drawn is white? Wrong answer: 0.15/106 .Causality The ﬁrst ball was drawn.

16/106 University of Applied Sciences Braunschweig/Wolfenbuettel .Causality Correct answer: 1 /3 1 /2 1 /2 B W 2 /3 B W B 2 /3 1 /3 W È ´Ï½ Ï¾µ ½ ½·½ ¿ ½ ¿ Data Mining – p.

why the answer 0. What is the probability that the ﬁrst ball is white? Answer: ½ ¿ What is the probability that the ﬁrst ball was white.Causality Another reason.17/106 . when the second ball is white? Answer: 0 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.5 is wrong: Assume that there are two black balls and one white ball in the bag in the beginning.

Does this statistic prove that the selection scheme favours male students? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. of grants 3000 373 3000 1304 6000 1677 female male total Further investigations have shown that the female students applying for grants have more or less the same school marks as the male applicants.Hidden variables The University of Fantasia provides grants for students based on an individual selection scheme.18/106 . of applications No. No.

Engineering Social Sci.Hidden variables Looking at the distributions over the subjects. No. it turns out that natural sciences and engineering students are favoured compared to social sciences students. female 400 300 2300 male 1400 1200 400 No. Engineering Social Sci. of grants Natural Sci. of applications Natural Sci.19/106 female male University of Applied Sciences Braunschweig/Wolfenbuettel . 200 150 23 700 600 4 Data Mining – p.

20/106 .Simpson’s paradox Hospital Therapy Patients Efﬁciency Old A New Old B New 250 1050 1050 250 No effect Cured 70 180 (72%) 420 630 (60%) 630 420 (40%) 180 70 (28%) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Simpson’s paradox Therapy Patients Efﬁciency Old New 1300 700 600 (46%) 1300 600 700 (54%) No effect Cured University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.21/106 .

Simpson’s paradox Market Competitor Sales Year Europe We Competitor Asia We 250 1050 1050 250 2005 2006 70 180 (+157%) 420 630 (+50%) 630 420 ( 33%) 180 70 ( 61%) The competitor has performed better than our company.22/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Simpson’s paradox Competitor Worldwide Sales Year We 1300 700 600 ( 14%) 1300 2005 2006 600 700 (+17%) We are better than the competitor!? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.23/106 .

Anderson in 1935 contains measurements of four real-valued variables: sepal length.24/106 . Iris Versicolor. Iris Virginica (50 each) The ﬁfth attribute is the name of the ﬂower type. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Example data set: iris data collected by E. sepal widths. petal lengths and petal width of 150 iris ﬂowers of types Iris Setosa.

Example data set: iris data iris setosa.25/106 . iris versicolor and iris virginica University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

2 Iris-setosa 4.4 0.26/106 University of Applied Sciences Braunschweig/Wolfenbuettel .4 Iris-versicolor 3.8 .. .5 ...1 1. 5.1 1. .Example data set: iris data slength 5..8 Iris-virginica Data Mining – p.1 Iris-versicolor 4.0 3.0 3..7 2.2 Iris-setosa 1. 5. ..0 1.7 1.. 5.1 2.5 5.3 Iris-versicolor 5.9 3....3 7.2 ..0 swidth plength pwidth species 1.4 0..1 3.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Therefore. each of the lines with the data must contain 5 values.27/106 . The following lines contain the data with the values for the deﬁned attributes separated by blanks or tabs.Example data set: iris data The header (ﬁrst line) speciﬁes names for the attributes.

Assignments are written in the form x <.28/106 .y y is assigned to x. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. before it can be assigned to x. Declaration of x is not required.Statistics tool R R uses a type-free command language. The object y must be deﬁned (generated).

The chosen ﬁle is assigned to the object named iris.29/106 .R: Reading a ﬁle iris<-read. header = T a header.header=T) opens a ﬁle chooser. means that the chosen ﬁle will contain University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.table(file.choose().

30/106 . The command print(pw) prints the corresponding column.iris$pwidth assigns the column named pwidth of the data set contained in the object iris to the object pw.R: Accessing a single variable pw <. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

.8 2.8 2.R: Printing on the screen [1] [19] [37] [55] [73] [91] [109] [127] [145] 0.3 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.3 0.3 1.1 1.2 2.3 1.0 2.7 1.2 1.3 2.1 1.3 2.6 2.5 2.4 0.9 0. ..1 1.4 0.0 1.1 1.0 2. .2 0.6 1.1 0.6 1.0 1.5 0.5 .2 0.8 1.3 0.3 1..8 2.5 1.2 0.31/106 .0 2.1 1.1 0.5 0..2 1.2 0.0 0.2 1.2 1.5 1.2 2.4 0.0 1.2 2..4 1.4 1....0 1.9 0.1 0.4 0.2 1. .5 1.4 2.4 1.9 1.3 2.6 1.3 0.3 0.8 0.8 2.2 0.3 1.8 1.3 2. 0..7 2.2 0.2 0.5 1.. ..5 1. .4 0.3 1.0 1.0 2.3 1.3 1.8 1.1 2.2 0..3 0.0 1.2 1.2 1.5 1..3 1.3 1.3 0.8 0.3 1.5 1..2 1.5 1.2 0.5 2.1 1. ..5 1.2 0.3 1.9 2.2 0.4 0.3 0.2 1.2 1.1 1.1 1.4 2..4 1. .2 0.2 0.8 2.

is given by Deﬁnition. we always consider a sample Ü½ ÜÒ with Ü ÁÊ for all ½ Ò.32/106 . Ü Ò ½ Ò ½ Ü Note the difference between the mean of a random variable and the empirical mean of a sample. ¾ ¾ The sample mean or empirical mean. denoted by Ü.Empirical mean In the following. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

(Empirical) median Ü´½µ Ü´Òµ denotes the sample in ascending order.33/106 . is given by Deﬁnition. The (sample or empirical) median denoted by Ü. Ü Ü´ Ò·½ µ ¾ Ü´ Ò µ·Ü´ Ò ·½µ ¾ ¾ ¾ if Ò is odd if Ò is even University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

(Empirical) median m e d ia n n o d d n e v e n m e d ia n University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.34/106 .

198667 > median(pw) [1] 1. respectively.35/106 .3 The mean and median can also be applied to data objects consisting of more than one (numerical) column. > mean(pw) [1] 1. yielding a vector of mean/median values.R: Empirical mean & median The mean and median can be computed using R bythe functions mean() and median(). University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

×¾ × standard deviation. Ô×¾ is called sample standard deviation or empirical Ò ½ ½ Ò ½ ´Ü Üµ¾ Ò Ò ½ Ü¾ Ò ½ Ò´Ò ½µ Ü ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Empirical variance The sample variance or empirical variance is deﬁned by Deﬁnition.36/106 .

37/106 . > var(pw) [1] 0.7631607 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.5824143 The function sd() yields the empirical standard deviation. > sd(pw) [1] 0.R: Empirical variance The function var() yields the empirical variance in R.

> min(pw) [1] 0.5 Ü´Òµ Ü´½µ is called span. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.1 > max(pw) [1] 2.38/106 .R: min and max The functions min() and max() compute the minimum and the maximum in a data set.

the interquartile range Ü¼ Ü¼ ¾ is often used as a measure of dispersion. Therefore. but also the emperical variance is sensitive to outliers.Interquartile range The empirical variance or the deviation as well as the span are measures of dispersion. The span is extremely sensitive to outliers. Interquartile range in R (for the data in pw): IQR(pw) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.39/106 .

Visualisation John W. Tukey: ”There is no excuse for failing to plot and look.40/106 .” University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

41/106 . In R. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Bar charts A bar chart shows the relative or absolute frequencies for the values of an attribute with a ﬁnite domain. the package lattice is required.

(Packages need to be downloaded only once. but they might need to be installed again.42/106 . available packages are shown in alphabetical order.Installation of R-packages Once the computer is connected to the Internet. packages (e.) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. lattice) can be downloaded and installed by the command install.g.packages() After choosing the mirror site for download.

Bar charts example: cl <.iris$species barchart(cl) Iris−virginica Iris−versicolor Iris−setosa 0 10 20 30 40 50 University of Applied Sciences Braunschweig/Wolfenbuettel Freq Data Mining – p.43/106 .

bins (intervals) representing the classes must be deﬁned. In most cases. intervals of the same length are chosen.44/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Histograms A histogram shows the absolute number of data or the relative frequency of data in different classes. For numerical samples. The area of each rectangle is proportional to the number of data in the corresponding range.

It is stored in the PostScript ﬁle "outputﬁle.eps") the generated graphics will not be shown.eps".prob=T.R: Histograms The function hist(pw. Using the command postscript("outputfile.main="petal width") generates a histogram for the (numerical) data in pw. partitioning the domain of pw using breaks=6 (5 intervals of the same length). University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. showing relative frequencies (prob=T) with the caption "petal width".45/106 .breaks=6.

5 2.0 2.0 0.6 0.0 0.3 0.1 0.4 0.5 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.2 0.5 0.0 pw 1.46/106 .5 1.R: Histograms petal width Density 0.

Empirical cdf The empirical cumulative distribution function is ´Üµ Deﬁnition.ecdf(pw. including the title "petal width" in the graphics. given by ¾ ½ Ò Ü Ò Ü plot.main="petal width") generates the empirical cdf for the data set pw. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.47/106 .

2 0.48/106 .8 0.5 2.4 0.0 0.0 Fn(x) 0.0 2.0 x 1.Empirical cdf petal width 1.5 1.5 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 0.6 0.

49/106 .Boxplots University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

50/106 .Boxplots m e d ia n University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

51/106 .Boxplots in te rq u a rtile ra n g e { ~ 5 0 % o f th e d a ta University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

52/106 .5 in te r q a r tile r a n g e { { University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Boxplots ~ 1 .

53/106 .Boxplots o u tlie r University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

(Here: The sepal length for setosa.Boxplots or box and whiskers plots summarise important characteristics of a sample. Boxplots A boxplot for the data set stored in irisslbyclass can be generated by bxp(boxplot(irisslbyclass)) Usually. versicolor and virginica. boxplots for different samples are compared.) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.54/106 .

The inner fence is deﬁned by the two values ½ Õ½ ½ iqr and ¿ Õ¿ · ½ iqr.55/106 . Determine the median. Draw a thick line at the position of the median. 3.Boxplots 1.and the 75%-quartiles Õ½ and Õ¿ for the sample. The other dimension of the box can be chosen arbitrarily. 2. Determine the 25%. ¡ ¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. iqr Õ¿ Õ½ is the interquartile range. Draw a box limited by these two quartiles.

but within the inner fence. for instance by circles.56/106 . extreme outliers (out of the outer fence deﬁned by ½ Õ½ ¾ ½ iqr and ¿ Õ¿ ¾ ½ iqr) are drawn in a different way than mild outliers outside the whiskers. Data points lying outside the box and the whsikers are called outliers.Boxplots 4. Find the smallest data point greater than ½ and the largest data point smaller than ¿ . Enter these data points in the diagram. Sometimes. ¡ ¡ ¡ ¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Add "whiskers" to the box extending to these two data points. 6. 5.

5 6.5 Data Mining – p.5 1.5 3.0 2.57/106 .5 4.0 2.5 7.0 3.5 4.0 1.5 2.0 2.5 5.5 slength 4.5 6.5 5.0 0.0 3.0 swidth plength 1.5 7.Scatterplots with R plot(iris) 2.5 2.5 1 2 3 4 5 6 7 University of Applied Sciences Braunschweig/Wolfenbuettel 1 2 3 4 5 6 7 4.5 pwidth 0.

Scatterplots with R species <.symbol") splom(˜iris[1:4].superpose) Scatterplots of the numerical attributes using different symbols for the classes.58/106 .par.data = iris. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. groups = species.sym <. panel = panel.which(colnames(iris)=="species") super.trellis.get("superpose.

Scatterplots with R 2.5 pwidth 7 6 5 4 4 5 6 7 plength 4 3 2 1 4.5 4.0 slength 6 5 5 6 Scatter Plot Matrix University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 2.59/106 .5 2.0 4.0 8 7 7 8 2.0 3.0 1.5 2.5 3.5 4.5 2 3 4 1 swidth 3.0 2.5 3.

groups = species.Scatterplots with R splom(˜iris[1:4].60/106 . "Virginica")))) Note: This needs the additional R package lattice which has to be installed and loaded ﬁrst. points = list(pch = super. key = list(title = "Three Varieties of Iris".sym$col[1:3]). columns = 3. col = super. text = list(c("Setosa".panel = panel. "Versicolor". data = iris.sym$pch[1:3].superpose. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

61/106 .Scatterplots with R University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

0 1 0.3D scatterplots with R scatterplot3d(iris2$pw.iris2$sw.iris2$pl.5 3 2 2.62/106 .5 2.0 0.0 5 4 2.0 2.5 iris2$sw 7 6 3. pch=iris2$species) 3.5 iris2$pw University of Applied Sciences Braunschweig/Wolfenbuettel iris2$pl Data Mining – p.0 1.5 1.5 4.0 4.

63/106 .Enriched 3D scatterplots University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

PCA is a technique for dimension reduction. which can also be used for visualisation. Principal component analysis (PCA) PCA aims at ﬁnding a linear mapping of the data to a lower-dimensional space that maintains as much of the original variance as possible of the original data without stretching during the projection.64/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

65/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.PCA A projection of the two-dimensional data to the (one-dimensional) line (principal component) leads to a dimension reduction with little loss of variance.

PCA Projection of Ô-dimensional data to Õ -dimensional space where Õ Ô. Then a Õ Ô-matrix is needed for the projection: ¢ Ý Å ¡ ´Ü Üµ Ü ½ Ü denotes the mean value: Ò ½ Ò Ü Data Mining – p.66/106 University of Applied Sciences Braunschweig/Wolfenbuettel . First. the data are centred in the origin of the coordinate system.

are independent.67/106 . ×ÜÝ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Covariance The empirical covariance of two attributes deﬁned as and is ×ÜÝ In case and Ò ½ ½ Ò ½ ´Ü Üµ´Ý Ýµ ¼ should hold.

68/106 .Covariance The empirical correlation coefﬁcient of as the normalised covariance: and is deﬁned ½ The larger × × × ½ always holds. . the more the and correlate. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Data Mining – p.PCA The projection matrix for PCA is given by Å ´Ú½ ÚÕ µ where Ú½ ÚÕ are the normalised eigenvectors of the covariance matrix of the data Ò ½ ½ Ò ½ ´Ü Üµ´Ü Üµ Õ.69/106 for the Õ largest eigenvalues ½ University of Applied Sciences Braunschweig/Wolfenbuettel .

PCA The sum of the variances of the projected data is the sum of these eigenvalues: ½· · Õ When PCA is used for dimension reduction. ½· ½· · · Õ ¡ ½¼¼± Ô Data Mining – p. it is important to know how much of the original variance is covered by the projection to Õ dimensions.70/106 University of Applied Sciences Braunschweig/Wolfenbuettel .

which(colnames(iris)=="species") iris_pca <.71/106 . then dset_pca <.scale=T) is sufﬁcient. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.prcomp(dset.-species]. This is carried out here as well.scale=T) The nominal attribute species must be omitted for PCA.center=T.prcomp(iris[. If all attributes of a data set dset are numerical.PCA with R species <. center=T.

a vector of length equal the number of columns of the data set can be supplied. a vector of length equal the number of columns of the data set can be supplied. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.72/106 . Alternately. The value is passed to scale. scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE. Alternatively.PCA with R center: a logical value indicating whether the variables should be shifted to be zero centred. but in general scaling is advisable.

5235463 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.7210168 0.06541577 -0.92555649 -0.5656110 PC2 PC3 PC4 -0.7061120 0.37231836 0.1241348 -0.1408923 -0.6338014 0.5223716 swidth -0.2633549 plength 0.1435538 Rotation: PC1 slength 0.73/106 .9598025 0.3838662 0.PCA with R > print(iris_pca) Standard deviations: [1] 1.02109478 -0.2420329 -0.8011543 -0.2619956 -0.5812540 pwidth 0.

054000 3.728 0.00515 Cumulative Proportion 0.758667 1.843333 3.0368 0.14355 Proportion of Variance 0.960 0.198667 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.230 0.706 0.3839 0.PCA with R > summary(iris_pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.9949 1.728 0.958 0.74/106 .00000 > iris_pca$center slength swidth plength pwidth 5.

PCA with R plot(predict(iris_pca)[]) PC2 −2 −3 −1 0 1 2 −2 −1 0 PC1 1 2 3 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.75/106 .

(for example: Ý ¡Ü· ) ¯ Deﬁne an error or goodness of ﬁt function (for example: the mean squared error ½ Ý µ) ´ µ Ò Ò ½´ Ü · Find the optimum of the ﬁtting function (for example: standard least squares regression) È ¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Digression: Regression ¯ ¯ ¯ Given: data (for example: two real-valued attributes) Choose a “model”.76/106 .

20 0.00 0.Digression: Regression 0.05 0.5 1.10 y 0.0 1.77/106 .5 3.5 x 2.0 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 2.0 0.15 0.

5 -2 0 -1.5 1 0.5 2-2 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.5 1.5 1 -1.5 -0.Digression: Regression 140 120 100 80 60 40 20 0 2 1.78/106 .5 -1 -0.5 0 -1 0.

or 3-)dimensional space with the aim to preserve the distances between the data as much as possible.79/106 . The deﬁnition of a suitable error measure is required. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Multidimensional scaling (MDS) Multidimensional scaling tries to map the high-dimensional data to low-(2.

MDS Typical error mesures: ½ ÈÒ ÈÒ ½ ½ ·½ ´ Ü µ¾ Ü Ü Ò ½ Ò ½ ¾ Ò ·½ ´ Ý Ü µ¾ ¾ Ò ½ Ò ·½ ½ Ý Ü ¿ ÈÒ ÈÒ ½ Ò ·½ ´ Ý ·½ Ü Ü µ¾ Data Mining – p.80/106 University of Applied Sciences Braunschweig/Wolfenbuettel .

´Ü½ To minimise or maximise a function Ý multiple variables.r.MDS The error is often called stress. For none of these error functions an analytical solution for the minimum is known.81/106 . gradient techniques are one possibility. ¿ is called Sammon method. University of Applied Sciences Braunschweig/Wolfenbuettel ÜÔ µ in Data Mining – p.t. Therefore a gradient ascent is applied for minimising the error function. Optimisation w.

one starts in an arbitrary point.t. one simply goes into the opposite direction of the gradient. computes the gradient in the new point and continuous until convergence. Ü½ ascend of . goes into the direction of the gradient. A gradient method can only ﬁnd local extrema at best! University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.82/106 . In order to minimise a function. computes the gradient. In order to maximise .Gradient method The gradient (the vector give by the partial derivatives ÜÔ ) points in the direction of the steepest w.r.

MDS Determining the gradient for error function ½ : Ý Ý Ý Ý Ý ¾ ´ Ý Ý Ý if otherwise ¼ Ý ½ ÈÒ ÈÒ ½ ·½ ´ Ü µ¾ Ý Ü Ý Ý Ý University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.83/106 .

maximum number of steps ×. Ò). 4. Compute 6.84/106 È ¡ « ½ Ò). Compute 5. Ý½ 3. Ý µ¾ and the . if Ò ½ ´ maximum number of iteration steps is not reached. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Compute Ü Ý ( ½ Ò). threshold value . step width «. Update Ý new ÝÒ ½ ÁÊÕ randomly (or with a ( Ý ( Ý old 7. Given: Data set Ü½ ÜÒ ÁÊÔ projection dimension Õ Ô.MDS algorithm 1. Initialise PCA projection). Repeat from step 4. Ý ( ½ Ò). 2.

092 stress after 20 iters: 0.iris.00408.dist(iris[.magic=0.sammon) Note that duplicate tuples must be removed in advance.00405.magic=0.030 stress after 40 iters: 0.magic=0.213 stress after 30 iters: 0.k=2) Initial stress : 0.iris <.500 > plot(iris.-species]) > iris.00447.sammon(d.00691 stress after 10 iters: 0. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.magic=0.R: MDS (Sammon mapping) > d. Otherwise zero distances lead to division by zero.00492.sammon <.85/106 .

5 iris.2] −1.5 0.sammon$points[.0 −0.1] 1 2 3 4 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.MDS 1.sammon$points[.0 −3 −2 −1 0 iris.5 1.0 0.86/106 .

data University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Visualisation of high-dimens.87/106 .

4 X3 0.2 0.8 0.4 0.0 0.2 0.6 0.8 0.88/106 0.6 0.2 0.4 0.4 X2 0.0 0.Scatterplots 0.2 0.8 .8 0.8 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 0.4 0.0 0.6 0.2 0.4 X1 0.6 0.0 0.8 0.6 0.0 0.6 0.2 0.

2 0.89/106 .4 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.2 0.4 −0.0 0.6 −0.0 PC1 0.6 −0.4 −0.2 0.PCA PC2 −0.2 0.4 0.

2 0.1] 0.4 −0.5 −0.90/106 .6 −0.5 −0.0 0.2] −0.MDS sampo[.2 sampo[.6 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.8 0.4 0.0 0.

91/106 .Parallel coordinates parallel(iris) species pwidth plength swidth slength Min Max University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

92/106 .D. 2691 255 7 Male ? Child 0 256 49 Male Yes College 30000 257 76 Male Yes Ph.Properties of data sets ID Age Sex Married Education Income 248 54 Male Yes High school 10000 249 ? Female Yes High school 12000 250 29 Male Yes College 23000 251 9 Male No Child 0 252 85 Female No High school 19798 253 40 Male Yes High school 40100 254 38 Female No Ph. 30686 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.D.

A single line refers to an individual. variablen or features. case. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. datum.Properties of data sets ¯ ¯ ¯ Such a data table is called data set. instance. record or entity.93/106 . The columns are called attribute. object.

Types of attributes Nominal (categorical) attributes have ﬁnite domain.94/106 . databases. Nationality (German/English/French/ ) ) ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Examples: ¯ Sex (male/female) (binary attribute) Major subjects (statistics.

95/106 . Examples: ¯ German school types (Hauptschule/Realschule/Fachgymnasium/ Gymnasium) Employee status (worker/department head/CEO) ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Types of attributes Ordinale attributes have a ﬁnite domain endowed with a linear ordering.

Differences can be calculated. there is no absolute deﬁnition of zero. Examples: ¯ ¯ Date (year) (arbitrary deﬁniton of the year 0) Temperature University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. However.96/106 . but sums or products are meaningless.Types of attributes Interval variables are numbers measured in the same unit.

Example: ¯ ¯ distance Number of children University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Types of attributes Ratio attributes have a well-deﬁned location of zero.97/106 . in contrast to interval attributes. Quotients and sums are meaningful.

98/106 . Integer attributes Examples: ¯ Date (year) (interval attribute) Number of children (ratio attribute) ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Types of attributes can be interval or ratio attributes.

99/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Examples: ¯ ¯ ¯ Temperature (interval attribute) Distance (Ratio attribute) There are special attributes like angles that behave only locally as an ordinal or interval attribute.Types of attributes Continuous attributes can be interval or ratio attributes.

100/106 .Missing Values For some instances values of single attributes might be missing. Causes for missing values: ¯ ¯ ¯ broken sensors refusal to answer a question irrelevante attribut for the corresponding object (pregnant (yes/no) for men) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

most frequent value.101/106 .Treatment of missing values ¯ When there are only few missing values: Remove objects with missing values. or estimation based on other attributes) Treatment of missing values before or during the application of data mining algorithms (depending on the problem and the algorithm) ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. median. Imputation of missing values (mean value.

e.Types of missing values Consider the attribute denoted by ?. A missing value is is the true value of the considered attribute.102/106 . i. we have obs if obs Let be the (multivariate) (random) variable denoting the other attributes apart from . obs . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

MCAR is also called Observed At Random (OAR).Types of missing values Missing completely at random (MCAR): The probability that a value for is missing does neither depend on the true value of nor on other variables. È´ obs µ È´ obs µ Example: The maintenance staff sometimes forgets to change the batteries of a sensor. so that the sensor sometimes does not provide any measurements.103/106 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.104/106 . so that the sensor does not always provide measurements when it is raining.Types of missing values Missing at random (MAR): The probability that a value for is missing does not depend on the true value of . È´ obs µ È´ obs µ Example: The maintenance staff does not change the batteries of a sensor when it is raining.

In the cases of MCAR and MAR.) In the extreme case of the sensor for the temperature. it is impossible to make provide any statement concerning temperatures below ¼ C. Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.105/106 . the missing values can be estimated – at least in principle.Types of missing values Nonignorable: The probability that a value for missing depends on the true value of . when the data set is large enough – based on the values of the other attributes. (The cause for the missing values is ignorable. is Example: A sensor for the temperature will not work when there is frost.

the missing values might not follow the distribution of . ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. it is possible to derive reasonable imputations for the missing values.106/106 . In the case of nonignorable missing values it is impossible to provide sensible estimations for the missing values. In the case of MAR. But by taking the other attributes into account. it can be assumed that the missing values follow the same distribution as the observed values of .Missing values ¯ In the case of MCAR.

A classiﬁcation problem University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.1/12 .

2/12 .A classiﬁcation problem University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

A classiﬁcation problem University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.3/12 .

4/12 .A classiﬁcation problem University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

5/12 . beginner University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. hobby player.Darts results professional.

6/12 .Darts results professional. beginner University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. hobby player.

7/12 .1D darts distributions University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

minimise the expected loss. cost matrix ¡ for misclassiﬁcations £ £ ¨ £ ¡ ¥ ¦£¤¢ © Data Mining – p.8/12 resulting costs when predicting class instead of the correct class General assumption: University of Applied Sciences Braunschweig/Wolfenbuettel ¡ ¥ ¦£¤¢ £ ¨§ § .Objectives for classiﬁcation Minimise the misclassiﬁcation rate or. more generally.

Cost matrix example Production of items (cups. ¢ £¡ ) Broken items should be sorted out cost matrix predicted class broken true class OK OK 0 broken 0 ¤ £¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.9/12 .

payment for compensation. 1 1 0 Data Mining – p.Cost matrix example ¢ £¡ : production costs for one item ¤ ¡ : posting costs for resending the item.10/12 University of Applied Sciences Braunschweig/Wolfenbuettel ... . . loss of reputation. £ 1 1 1 . cost matrix for misclassiﬁcation rate predicted class £ £ ¢ true class ¢ £ 0 1 .

11/12 University of Applied Sciences Braunschweig/Wolfenbuettel ¨ .Expected loss § £¤¢ ¨§ £ ¨ and predicting The expected loss given evidence class is ¡ ¨ £¤¢ £¤¢ £ © ¦ §¨ ¨ ¨§ ¡§ ¢¤£ ¦¢ ¥ ¡ § £ loss © ¦ ¡ ¥ £¤¢ § § ¡ £ ¥ ¤¢ £ ¢¤£ ¥ © ¦¢ ¨ £ © predicted argmin loss £ ¤¢ ¨ ¡ Data Mining – p.

£ It is sufﬁcient to consider the likelihoods (unnormalised probabilities) ¢ ¡ ¡ ¨ ¡ ¦¢ § §¨ ¡ £ ¥ ¤¢ £ ¨ ¡ © and the relative expected losses ¡ ¨ £ ¤¢ © £ ¤¢ rloss ¢¤£ © ¥ §¨ § ¨ ¨ ¡§ ¦¢ ¡ § ¨¨ ¨ § ¦¢ ¡ £ £¤¢ ¦ § § ¡ ¥ £¤¢ £ ¢¤£ ¥ Data Mining – p.Expected loss ¦¢ ¨ is a constant factor independent of .12/12 University of Applied Sciences Braunschweig/Wolfenbuettel ¨ .

1/79 .A very simple decision tree Assignment of a drug to a patient: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Test the attribute associated with the node. If the current node is an leaf node: Æ Æ Æ Æ Return the class assigned to the node.Classiﬁcation with decision trees Recursive Descent: ¯ ¯ Start at the root node. Apply the algorithm recursively. ¯ If the current node is an inner node: University of Applied Sciences Braunschweig/Wolfenbuettel Intuitively: Follow the path corresponding to the case to be classiﬁed. Data Mining – p.2/79 . Follow the branch labeled with the outcome of the test.

3/79 .Classiﬁcation with decision trees Assignment of a drug to a 30 year old patient with normal blood pressure and: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

4/79 .Classiﬁcation with decision trees Assignment of a drug to a 30 year old patient with normal blood pressure and: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

5/79 .Classiﬁcation with decision trees Assignment of a drug to a 30 year old patient with normal blood pressure and: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Induction of decision trees ¯ Top-down approach Æ Æ Æ Æ Æ Æ ¯ Greedy selection of a test attribute Build the decision tree from top to bottom (from the root to the leaves). Apply the procedure recursively to the subsets. Compute an evaluation measure for all attributes. Select the attribute with the best evaluation. Terminate the recursion if – all cases belong to the same class – no more test attributes are available ¯ Divide and conquer / recursive descent University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Divide the example cases according to the values of the test attribute.6/79 .

7/79 .Decision tree induction: Example Patient database ¯ ½¾ ¯ ¯ No 1 2 3 4 5 6 7 8 9 10 11 12 Sex male female female male female male female male male female female male Age 20 73 37 33 48 29 52 42 61 30 26 54 Blood pr. normal normal high low high normal normal low normal normal low high Drug A B A B A A B B B A B A example cases ¿ descriptive attributes ½ class attribute Assignment of drug (without patient attributes) always drug A or always drug B: 50% correct (in 6 of 12 cases) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

r.Decision tree induction: Example Sex of the patient ¯ No Sex male male male male male male female female female female female female Drug A A A B B B A A A B B B Division male/female. 1 6 12 4 8 9 3 Assignment of drug male: female: total: 50% correct 50% correct 50% correct (in 3 of (in 3 of cases) cases) 5 10 2 7 11 (in 6 of 12 cases) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.8/79 .t. w.

Decision tree induction: Example Age of the patient ¯ ¯ No 1 Age 20 26 29 30 33 37 42 48 52 54 61 73 Drug A B A A B A B A B A B B Sort according to age. 40 years 11 6 10 4 3 8 5 Assignment of drug ¼: A 67% correct ¼: B 67% correct 7 (in 4 of (in 4 of cases) cases) 12 9 2 total: 67% correct (in 8 of 12 cases) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.9/79 . here: ca. Find best age split.

Decision tree induction: Example Blood pressure of the patient ¯ No 3 Blood pr. high high high normal normal normal normal normal normal low low low Drug A A A A A A B B B B B B Division high/normal/low.t. w.10/79 .r. 5 12 1 6 Assignment of drug 10 high: normal: low: total: A 100% correct 50% correct B 100% correct 75% correct (in 3 of ¿ cases) (in 3 of cases) 2 7 9 4 8 11 (in 3 of ¿ cases) (in 9 of 12 cases) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

11/79 .Decision tree induction: Example Current decision tree: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

5 12 1 6 9 2 7 10 4 ¯ Assignment of drug male: female: total: A 67% correct B 67% correct 67% correct (2 of 3) (2 of 3) (4 of 6) 8 11 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.r.t. high high high normal normal normal normal normal normal low low low male male male female female female Sex Drug A A A A A B B B A B B B Only patients with normal blood pressure.12/79 .Decision tree induction: Example Blood pressure and sex ¯ No 3 Blood pr. male/female. Division w.

40 years 5 12 1 6 10 7 9 2 11 4 Assignment of drug ¼: A 100% correct ¼: B 100% correct (3 of 3) (3 of 3) (6 of 6) 8 total: 100% correct University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.13/79 . here: ca. Sort according to age. high high high normal normal normal normal normal normal low low low 20 29 30 52 61 73 Age Drug A A A A A A B B B B B B ¯ ¯ Only patients with normal blood pressure.Decision tree induction: Example Blood pressure and age ¯ No 3 Blood pr. Find best age split.

14/79 .Decision tree induction: Example Resulting decision tree: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Æ absolute frequency of the class absolute frequency of the attribute value absolute frequency of the combination of the class and Æ value . in cases having attribute value University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Æ ½Æ ½Æ . Æ Æ . . relative frequency of the class .Ô Ò Ò .e.Ô relative frequency of the combination of class Æ Æ relative frequency of the class Æ Ô Ô Æ Ô Ô and attribute value .Decision tree induction: Notation Ë ´½µ ´ a set of case or object descriptions Ñµ the class attribute other attributes (index dropped in the following) ½ ½ Æ Æ Æ Æ Ô Ô Ô Ô ÓÑ´ µ ÓÑ´ µ total number of case or object descriptions i. Ò Ò : number of classes : number of attribute values Ë ÈÒ ÈÒ Æ Æ and the attribute relative frequency of the attribute value .15/79 .

best A := .16/79 . end. Æ for ½ Ò and ½ compute value Ú of an evaluation measure using Æ .Principle of decision tree induction function begin grow_tree (S : set of cases) : node. Data Mining – p. for all ¾ ÓÑ´ ×Ø _ µ do Ü. best v end Ò . else create test node Ü. Æ . assign majority class of Ë to Ü. if Ú ×Ø _Ú then best v := Ú . for all untested attributes do compute frequencies Æ . if University of Applied Sciences Braunschweig/Wolfenbuettel ×Ø _ ). assign test on attribute best A to Ü.child[ ] := grow_tree(Ë end. Æ . end. ×Ø _Ú WORTHLESS then create leaf node Ü. end. return Ü. := WORTHLESS. Æ .

However. a good choice here can be important for deeper levels of the decision tree. Data Mining – p. If there are more than two classes. ¯ Disadvantage: works well only for two classes. ¯ Only the majority class—that is.Evaluation Measures ¯ ¯ Evaluation measure used in the above example: rate of correctly classiﬁed example cases. ¯ The distribution of the other classes has no inﬂuence. easy to understand. the rate of misclassiﬁed example cases neglects a signiﬁcant amount of the available information.17/79 University of Applied Sciences Braunschweig/Wolfenbuettel . the class occurring most often in (a subset of) the example cases—is really considered. ¯ Advantage: simple to compute.

18/79 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Here: ¯ Information gain ¯ ¾ measure (well-known in statistics). and its various normalisations.Evaluation measures ¯ Therefore: Study also other evaluation measures.

An Information-theoretic Evaluation Measure Information Gain (Kullback/Leibler 1951. Quinlan 1986) Ò Based on Shannon Entropy À ½ Ô ÐÓ ¾Ô µ Á Ò´ µ Þ À´ Ò µ ß Þ À´ Ô Ò ½ Ô ÐÓ ¾Ô Ò ß ÐÓ ½ ½ Ô ¾Ô À´ À´ À´ µ µ µ À´ µ Entropy of the class distribution ( : class attribute) Expected entropy of the class distribution if the value of the attribute becomes known Expected entropy reduction or information gain Data Mining – p.19/79 University of Applied Sciences Braunschweig/Wolfenbuettel .

20/79 .Example ID 1 2 3 4 5 6 7 8 9 10 Height m s t s t s s m m t Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ID 1 2 3 4 5 6 7 8 9 10 Height ÀHeight ½¼ ¿ m s t s t s s m m t ÐÓ ¾ ¿¡ Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m · ÐÓ ½ ¾ ½ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.21/79 .

22/79 .Example ID 1 2 3 4 5 6 7 8 9 10 Height ÀHeight m s t s t s s m m t ¿ ½¼ ¾ ¿ Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m ÐÓ ¾ ¿ ¾¡ · ½ ¿ ÐÓ ¾ ¿ ½ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ID 1 2 3 4 5 6 7 8 9 10 Height ÀHeight m s t s t s s m m t ¿ ½¼ ½ ¿ Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m ÐÓ ¾ ¿ ½¡ · ¾ ¿ ÐÓ ¾ ¿ ¾ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.23/79 .

24/79 .Example ÀHeight ½¼ ¿ ½¼ ¿ ½¼ ¼ ¿ ÐÓ ¾ ¾ ¾ ¿ ¾ ¿ ½ ¿ · ½ ÐÓ ¾ ¾ ¾ ½ ½ ¿ ¾ ¿ ¾ ÐÓ ¿ ½ ÐÓ ¿ ½ · ÐÓ ¿ ¾ · ÐÓ ¿ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

25/79 .Example ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m ÀWeight ¿ ½¼ ¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m ¿ ÀWeight ½¼ ÐÓ ¾ ¿¡ · ÐÓ ¾ ¾ ¾ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.26/79 .

Example ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m ÀWeight ¾ ½¼ ¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.27/79 .

28/79 .Example ÀWeight ¿ ½¼ ¡ ¼ ¿ ½¼ ÐÓ ¾ ½¼ ¡ ¼ ¼ ¾ ¿ · ¾ ÐÓ ¾ ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

29/79 .Example ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Àlong_hair ½¼ ¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Àlong_hair ½¼ ¾ ÐÓ ¾ ¾¡ · ÐÓ ¾ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.30/79 .

31/79 .Example Àlong_hair ½¼ ¡ ¼ ½¼ ¼ ½¼ ¾ ÐÓ ¾ ¾ · ÐÓ ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

32/79 . ¨¨ Ð ¨¨ ¨ Weight Ò ÀÀ ÀÀ À Ñ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Example The attribute Weight yields the largest reduction of entropy.

Example The remaining data table to be considered in the node ¡¡¡ : ID 1 4 5 8 10 Height Long hair Sex m n m s y f t y f m n f t n m University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.33/79 .

34/79 .Example ID 1 4 5 8 10 Height m s t m t Long hair Sex n m y f y f n f n m ÀHeight ½ ¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ID 1 4 5 8 10 Height m s t m t ½ ¾ Long hair Sex n m y f y f n f n m · ½ ¾ ÀHeight ¾ ÐÓ ¾ ¾ ½¡ ÐÓ ¾ ¾ ½ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.35/79 .

Example ID 1 4 5 8 10 Height m s t m t ½ ¾ Long hair Sex n m y f y f n f n m · ½ ¾ ÀHeight ¾ ÐÓ ¾ ¾ ½¡ ÐÓ ¾ ¾ ½ ¡¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.36/79 .

Example ID 1 4 5 8 10 Height Long hair Sex m n m s y f t y f m n f t n m Àlong_hair ¾ ¡ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.37/79 .

Example ID 1 4 5 8 10 Height Long hair Sex m n m s y f t y f m n f t n m ¿ Àlong_hair ¼ ½¼ ½ ¿ ÐÓ ¾ ¿ ½¡ · ¾ ¿ ÐÓ ¾ ¿ ¾ ¡¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.38/79 .

Example The attribute long hair yields the largest reduction of entropy.39/79 . ¨¨ Ð ¨¨ ¨ Ý Weight Ò ÀÀ Long hair ÀÀ À Ò Ñ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

40/79 . the resulting decision tree is: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. only the attribute Height is left with the remaining data table: ID 1 8 10 Height Sex m m m f t m Therefore.Example For the remaining node.

Example ¨¨ Ð¨ ¨ ¨ Ý Weight Ò ÀÀÀ Long hair À À Ñ Ò ¨¨ ×¨ ¨ ¨ Height ÀÀ Ñ ÀÀ Ø À Ñ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.41/79 .

42/79 University of Applied Sciences Braunschweig/Wolfenbuettel .A complex tree long Hair ¨¨ ×¨ ¨ ¨ Ò Ð Height ÀÀ Ñ Weight ÀÀ Ø À Weight Ý Weight Ò ? ? Ð Ò ? Ð Ò ? Ñ Û Long hair Ò Ñ Data Mining – p.

½ È ´× µ ¯ Shannon Entropy: À ´Ë µ ¯ Ò ½ È ´× µ ÐÓ ¾ È ´× µ Intuitively: Expected number of yes/no questions that have to be asked in order to determine the obtaining alternative.Interpretation of Shannon entropy ¯ Let Ë ×½ ×Ò be a ﬁnite set of alternatives ½ Ò. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. having positive probabilities È ´× µ.43/79 . ÈÒ satisfying ½.

which knows the obtaining alternative.44/79 University of Applied Sciences Braunschweig/Wolfenbuettel . number of Data Mining – p. but responds only if the question can be answered with “yes” or “no”. A better question scheme than asking for one alternative after the other can easily be found: Divide the set into two subsets of about equal size. Ask for containment in an arbitrarily chosen subset.Interpretation of Shannon entropy Æ Æ Æ Æ Suppose there is an oracle. Apply this scheme recursively questions bounded by ÐÓ ¾ Ò .

16 0.19 0.24 bit/symbol Code efﬁciency: 0.40 ×½ ×¾ ×¿ × × 1 2 3 4 ×½ ×¾ ×¿ × × ×½ ×¾ ×½ 2 0.664 Code length: 2.15 4 ×¾ ×¿ × 2 2 0.75 0.59 × × × 3 0.Question/Coding Schemes È ´×½µ ¼ ½¼ È ´×¾ µ ¼ ½ Shannon entropy: Linear Traversal È È ××¿ È ´ µ ¼½ È ´× µ ¼ ½ È ´× µ ´ µ ÐÓ ¾ È ´× µ ¾ ½ bit/symbol Equal Size Subsets ¼ ¼ ×½ ×¾ ×¿ × × ×¾ ×¿ × × ×¿ × × × × 0.40 3 Code length: 3.45/79 .59 bit/symbol Code efﬁciency: 0.19 0.16 0.830 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.15 0.10 0.25 ×¿ × × 0.10 0.

t.Question/Coding Schemes ¯ ¯ ¯ Splitting into subsets of about equal size can lead to a bad arrangement of the alternatives into subsets high expected number of questions. Sort the alternatives w. Good question schemes take the probability of the alternatives into account.r. Split the set so that the subsets have about equal probability (splits must respect the probability order of the alternatives). Data Mining – p. their probabilities. Shannon-Fano Coding (1948) Æ Æ Æ Build the question/coding scheme top-down.46/79 University of Applied Sciences Braunschweig/Wolfenbuettel .

Always combine those two sets that have the smallest probabilities. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Question/Coding Schemes ¯ Huffman Coding (1952) Æ Æ Æ Build the question/coding scheme bottom-up. Start with one element sets.47/79 .

15 2 2 ×¾ ×¿ 3 3 0.10 0.19 0.Question/Coding Schemes È ´×½µ ¼ ½¼ È ´×¾ µ ¼ ½ Shannon entropy: Shannon–Fano Coding È ´×¿µ ÈÈ× ¼½ ´ µ ÐÓ È ´× µ ¾ È ´× µ ¼½ È ´× µ ¼ ¼ ¾ ½ bit/symbol (1948) Huffman Coding (1952) ×½ ×¾ ×¿ × × ×½ ×¾ ×¿ ×½ ×¾ ×½ 3 0.19 0.35 ×¿ × × × 3 1 ×¾ × ¿ × 3 2 0.977 Data Mining – p.60 × × × ×½ 3 ×½ ×¾ 0.10 0.25 0.15 0.955 University of Applied Sciences Braunschweig/Wolfenbuettel Code length: 2.40 0.20 bit/symbol Code efﬁciency: 0.48/79 .41 0.59 ×½ ×¾ ×¿ × × ×½ ×¾ ×¿ × 0.40 Code length: 2.16 0.16 0.25 0.25 bit/symbol Code efﬁciency: 0.

) Only if the obtaining alternative has to be determined in a sequence of (independent) situations.Question/Coding Schemes ¯ It can be shown that Huffman coding is optimal if we have to determine the obtaining alternative in a single instance. (No question/coding scheme has a smaller expected number of questions. but combine two. Data Mining – p.49/79 ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . three or more consecutive instances and ask directly for the obtaining combination of alternatives. this scheme can be improved. Idea: Process the sequence not instance by instance.

Shannon showed that there is a lower bound. namely the Shannon entropy.Question/Coding Schemes ¯ Although this enlarges the question/coding scheme.50/79 . the expected number of questions per identiﬁcation cannot be made arbitrarily small. However. the expected number of questions per identiﬁcation is reduced (because each interrogation identiﬁes the obtaining alternative for several situations). ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Interpretation of Shannon Entropy È ´×½µ Shannon entropy: ½ ¾ È ´×¾µ ½ È ´×¿µ ½ È ´× µ È ´× µ ÐÓ ¾ È ´× µ È ½ ½ ½ È ´× µ bit/symbol ½ ½ If the probability distribution allows for a perfect Huffman code (code efﬁciency 1).875 bit/symbol Code efﬁciency: 1 Expected number yes/no questions. the Shannon entropy can easily be interpreted as follows: Perfect Question Scheme È ´× µ ÐÓ ¾ È ´× µ È ´× µ occurrence path length probability in tree ßÞ ¡ ÐÓ ¾ ßÞ ½ È ´× µ ×½ ×¾ ×¿ × × ×¾ ×¿ × × ×¿ × × × × ×½ ×¾ ×¿ × × ½ ¾ ½ ½ ½ ½ ½ ½ 1 2 3 4 4 Code length: 1.51/79 . of needed University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

(Quinlan 1986 / 1993) Ò´ Information Gain Ratio Á 1991) Ö´ µ Á À µ ÈÒ Á ½Ô Ò´ µ ÐÓ ¾Ô Symmetric Information Gain Ratio (López de Mántaras Á×´½µ ´ Ö µ Á À Ò´ µ or Á×´¾µ ´ Ö µ Á À Ò´ ·À µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.52/79 .Other evaluation measures from information theory Normalized Information Gain ¯ ¯ Information gain is biased towards many-valued attributes. Normalization removes / reduces this bias.

i.53/79 . of two attributes having about the same information content it tends to select the one having more values. ¯ The reasons are quantization effects caused by the ﬁnite number of example cases (due to which only a ﬁnite number of different probabilities can result in estimations) in connection with the following theorem: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Bias of information gain ¯ Information gain is biased towards many-valued attributes..e.

i. and be three attributes with ﬁnite domains and let their joint probability distribution be strictly positive.Bias of information gain Let .e.54/79 . . if È´ µ È´ µ. Then Theorem: Á Ò´ µ Á Ò´ µ with equality holding only if the attributes conditionally independent given ... and are University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.e. i. ¾ ÓÑ´ µ ¾ ÓÑ´ µ ¾ ÓÑ´ µ È ´ µ ¼.

Á Ò´ µ Ò ½ Ò ½ Ô ÐÓ Ô ¾ Ô Ô Data Mining – p. ¾ ´ µ Ò ½ Ò ½ Æ ´Ô Ô Ô Ô Ô µ¾ ¯ Side remark: Information gain can also be interpreted as a difference measure.¾ ¯ ¯ ¯ -measure Compares the actual joint distribution with a hypothetical independent distribution. Can be interpreted as a difference measure.55/79 University of Applied Sciences Braunschweig/Wolfenbuettel . Uses absolute comparison.

. . ÜÖ marginal of ÔÖ Ô¯ The random variable can take the values Ü½ the random variable the values Ý½ ÝÕ . ÝÕ marginal of Ô½Õ Ô½¯ . . . . . . . . . . . . . . . . . .Contingency tables Ü½ . Ò Ý½ Ô½½ . . . . . Ý Ô½ . . . . . . . . . . .56/79 . . . Ô is the (absolute) frequency of occurrences of the observation ´Ü Ý µ. . . . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Ü Ô½ ÔÖ½ Ô¯½ Ô ÔÕ ÔÖÕ Ô¯Õ Ô¯ ÔÖ ¯ Ò ÜÖ .

then the expected absolute frequencies are for all ¾ ½ Ö and all ¾ Ô ¯ Ô¯ Ò ½ Õ .Contingency tables Ô¯ Õ ½ Ô and Ô¯ Ö ½ Ô are the marginal (absolute) frequencies.57/79 . If and are independent. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

pol.58/79 .¾ independence test 1000 people were asked which political party they voted for in order to ﬁnd out whether the choice of the party and the sex of the voter are independent. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. partyÒ sex female male sum SPD 200 170 370 CDU/CSU 200 200 400 Grüne 45 35 80 FDP 25 35 70 PDS 20 30 50 Others 22 5 27 No answer 8 5 13 sum 520 480 1000 Example.

8 19. partyÒ sex SPD CDU/CSU Grüne FDP PDS O/NA female male 192.59/79 .0 24.4 26.4 31.4 177.¾ independence test Expected frequencies: pol.0 192.0 20.6 38.6 208.2 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 41.2 28.

female ¼¼ ½¼¼¼ female male 200 170 200 200 45 35 25 35 20 30 30 10 520 480 ¾¼ ¡ ½¼¼¼ ¡ ½¼¼¼ sum 370 400 80 70 50 40 1000 ¾¼ ¼ Data Mining – p. partyÒ sex SPD CDU/CSU Grüne FDP PDS O/NA sum For instance: CDU/CSU.60/79 University of Applied Sciences Braunschweig/Wolfenbuettel .¾ independence test pol.

¯ Preprocessing II / Multisplits during tree construction University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Flatten the tree to obtain a multi-interval discretization.Treatment of numerical attributes General Approach: Discretization ¯ Preprocessing I Æ Æ Æ Form equally sized or equally populated intervals. Build a decision tree using only the numeric attribute.61/79 .

Compute the evaluation measure for these binary attributes.62/79 . Possible improvements: Add a penalty depending on the number of splits. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Treatment of numerical attributes ¯ During the tree construction Æ Æ Æ Æ Sort the example cases according to the attribute’s values. Construct a binary symbolic attribute for every possible split (values: “ threshold” and “ threshold”).

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. ½ Ø) of the Ò data objects fall Assume that Ò ( into the interval between Ì ½ and Ì for the considered attribute .63/79 . whose domain Ø ½ cut points Ì½ ÌØ ½ are needed. domian of the attribute Ì¼ and ÌØ denote the left and right boundary of the .Treatment of numerical attributes Consider a numerical attribute should be split into Ø intervals. These cutpoints are chosen in such a way that the entropy induced by this partition is minimised.

Treatment of numerical attributes denotes the number of data objects among the Ò objects belonging to class . Then the entropy in the interval between Ì ½ and Ì is Ò ½ Ø ¡ ÐÓ ¾ Ò The entropy induced by the partition into he corresponding intervals is Ò Ò ½ ¡ which should be minimised by a suitable choice of the points Ì . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.64/79 .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. it is sufﬁcient to consider boundary points only.Treatment of numerical attributes A value Ì in the domain of the attribute is boundary point if the following holds for the sequence of sorted values for attribute: There are two objects Ü and Ý belonging to different Ì Ý and there is no object Þ classes satisfying Ü with Ü Þ Ý. For the minimisation of the entropy.65/79 .

Treatment of numerical attributes The boundary points are marked by lines. ¬ Value: ½ ¾ ¬ ¿ ¿ ¬ ¬ ¬ Class: ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ½¼ ¬ ½½ ½½ ½¾ ¬ ¬ ¬ ¬ For binary splits (only one cut point) all boundary points are considered and the one with the smallest entropy is chosen. For multiple splits a recursive procedure is applied. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.66/79 .

Breiman et al.Treatment of missing values Induction ¯ ¯ ¯ Weight the evaluation measure with the fraction of cases with known values.67/79 University of Applied Sciences Braunschweig/Wolfenbuettel . Æ Idea: The attribute provides information only if it is known.5. weighted in each branch with the relative frequency of the corresponding attribute value (C4. Quinlan 1993). Try to ﬁnd a surrogate test attribute with similar properties (CART. 1984) Assign the case to all branches. Data Mining – p.

aggregate the class distributions of all leaves reached.68/79 . and assign the majority class of the aggregated class distribution. weighted with their relative number of cases.Treatment of missing values Classiﬁcation ¯ ¯ Use the surrogate test attribute found during induction. Follow all branches of the test attribute. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Reduced error pruning Pessimistic pruning Conﬁdence level pruning Minimum description length pruning Data Mining – p. to avoid overﬁtting (improve generalization). Replace “bad” branches (subtrees) by leaves.69/79 Basic ideas: ¯ ¯ Common approaches: ¯ ¯ ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . Replace a subtree by its largest branch if it is better.Pruning decision trees Pruning serves the purpose ¯ ¯ to simplify the tree (improve interpretability).

Reduced error pruning ¯ ¯ ¯ ¯ ¯ ¯ Classify a set of new example cases with the decision tree. If a subtree has been replaced. If such a leaf leads to the same or fewer errors than the subtree. Determine the number of errors for leaves that replace subtrees. Data Mining – p. recompute the number of errors of the subtrees it is part of.70/79 University of Applied Sciences Braunschweig/Wolfenbuettel . replace the subtree by the leaf. (These cases must not have been used for the induction!) Determine the number of errors for all leaves. The number of errors of a subtree is the sum of the errors of all of its leaves.

Additional example cases needed.71/79 .Reduced error pruning Advantage: Very good pruning. Disadvantage: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. effective avoidance of overﬁtting. Number of cases in a leaf has no inﬂuence.

user-speciﬁed amount Ö. Determine the number of errors for leaves that replace subtrees (also increased by Ö). If such a leaf leads to the same or fewer errors than the subtree. (These cases may or may not have been used for the induction.) Determine the number of errors for all leaves and increase this number by a ﬁxed. replace the subtree by the leaf and recompute subtree errors.72/79 University of Applied Sciences Braunschweig/Wolfenbuettel .Pessimistic pruning ¯ ¯ ¯ ¯ ¯ Classify a set of example cases with the decision tree. The number of errors of a subtree is the sum of the errors of all of its leaves. Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Pessimistic pruning Advantage: Disadvantage: No additional example cases needed.73/79 . Number of cases in a leaf has no inﬂuence.

**Conﬁdence level pruning
**

¯

Like pessimistic pruning, but the number of errors is computed as follows: Æ See classiﬁcation in a leaf as a Bernoulli experiment (error/no error). Æ Estimate an interval for the error probability based on a user-speciﬁed conﬁdence level «. (use approximation of the binomial distribution by a normal distribution) Æ Increase error number to the upper level of the conﬁdence interval times the number of cases assigned to the leaf. Æ Formal problem: Classiﬁcation is not a random experiment.

Data Mining – p.74/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Conﬁdence level pruning

Advantage:

No additional example cases needed, good pruning. Statistically dubious foundation.

Disadvantage:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.75/79

Decision tree pruning: An example

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.76/79

Decision tree pruning: An example

A decision tree for the Iris data

(induced with information gain ratio, unpruned)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.77/79

**Decision tree pruning: An example
**

(pruned with conﬁdence level pruning, « pessimistic pruning, Ö ¾)

A decision tree for the Iris data

¼ , and

¯ ¯ ¯

Left: 7 instead of 11 nodes, 4 instead of 2 misclassiﬁcations. Right: 5 instead of 11 nodes, 6 instead of 2 misclassiﬁcations. The right tree is “minimal” for the three classes.

Data Mining – p.78/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Predictive vs. descriptive tasks

Predictive tasks:

The decision tree (or more generally, the classiﬁer) is constructed in order to apply it to new unclassiﬁed data.

Decriptive tasks:

The purpose of the tree construction is to understand, how classiﬁcation has been carried out so far.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.79/79

Bayes’ theroem

È ´À

Proof:

µ

È ´ À µ ¡ È ´À µ È´ µ

È ´ À µ ¡ È ´À µ È´ µ

È ´ Àµ È ´À µ

¡

È ´À µ

µ

È´

È ´À

µ

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.1/48

Bayes’ theroem

Interpretation: The probability that a hypothesis À is true given event has occured, can be derived from the probability

¯ ¯ ¯

of the hypothesis itself, of the event and the conditional probability of the event given the hypothesis.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.2/48

Bayes classiﬁers

Principle of Bayes classiﬁers: The value of the nominal attribute À should be predicted based on the values of the attributes ´ ½ ½ Ñ , i.e. the attribute vector Ñ µ. If is one of the possible values of attribute À and the other attribute have taken the values ½ ½ Ñ Ñ , then Bayes’ theorem yields the probaility for À given ½ ½ Ñ Ñ:

È ´À È´

´

´

½

½

Ñ µµ

µ¡

È´

Ñµ À

´

½

È ´À

µ

Ñ µµ

Data Mining – p.3/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Bayes classiﬁers

Compute this probability for all possible values (classes) of the nominal attribute À and choose the class with the highest probability. (A cost matrix can also be incorporated.) Since the denominator is independent of , it does not have any inﬂuence on the decision for the class. Therefore, usually only the likelihoods

È´

are considered.

´

½

Ñµ À

µ¡

È ´À

µ

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.4/48

Bayes classiﬁers

The probability È ´À µ can be estimated easily based on a given data:

È ´À

µ

**no. of data from class no. of data
**

½

´ In principle, the probability È ´ could be determined analogously:

Ñµ À

µ

½

µ

È´

´

½

Ñµ À

**no. of data from class with values ´ no. of data from class
**

University of Applied Sciences Braunschweig/Wolfenbuettel

Ñµ

Data Mining – p.5/48

we would need ¿½¼ ¼ data objects to have at least one example per combination.e. the computation is carried out under the (naïve. È´ È´ ½ ½ ´ ½ Ñµ À µ¡ ¡ µ À È´ Ñ Ñ À µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. i.Bayes classiﬁers For Ò ½¼ nominal attributes ½ ½¼ . Therefore. unrealistic) asumption that the attributes ½ Ñ are independent given the class.6/48 . each having three possible values.

Bayes classiﬁers È´ À µ can be computed easily: È´ À µ no.7/48 . of data from class University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. of data from class with no.

Based on the values ½ Ñ of the attributes ½ Ñ a prediciton for the value of the attribute À should be derived.8/48 .Naïve Bayes classiﬁer given: A data set with only nominal attributes. For each class (each value in the domain of À ) compute the likelihood Ä´À È´ ½ ½ ½ ½ ¡ Ñ Ñ ½ Ñµ À µ¡ È´ Ñ À µ¡ È ´À µ under the assumption that the independent given the class. University of Applied Sciences Braunschweig/Wolfenbuettel Ñ are Data Mining – p.

the classiﬁer often yields good results. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Naïve Bayes classiﬁer Assign ´ ½ likelihood. Ñ µ to the class with the highest ı This Bayes classiﬁers is called wird als na¨ve because of the (conditional) independence assumption for the attributes ½ Ñ. Although this assumption is unrealistic in most cases. when not too many attributes are correlated.9/48 .

Example How does a naïve Bayes classiﬁer classify the object ´Ø Ð Ý µ? We need to calculate Ä´Sex Ñ Height Ø Weight Ð long_hair Ýµ È ´Height Ø Sex Ñµ¡ È ´Weight Ð Sex Ñµ¡ È ´long_hair Ý Sex Ñµ¡ È ´Sex Ñµ and University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.10/48 .

11/48 .Example Ä´Sex Height Weight Ø Ð Long_hair Ýµ µ¡ È ´Height Ø Sex È ´Weight Ð Sex µ¡ È ´Long_hair Ý Sex µ È ´Sex µ¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex Ñµ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.12/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex Ñµ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.13/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

14/48 University of Applied Sciences Braunschweig/Wolfenbuettel .Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex Ñµ ¾ ½ ¾ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.

Example È ´Weight ID 1 2 3 4 5 6 7 8 9 10 Ð Sex Ñµ ¼ ¼ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.15/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

Example È ´Long_hair ID 1 2 3 4 5 6 7 8 9 10 Ý Sex Ñµ ¼ ¼ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.16/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

Example È ´Sex Ñµ ID 1 2 3 4 5 6 7 8 9 10 ½¼ ¾ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.17/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

Example Ä´Sex Ñ Height ½ ¾ Ø Weight Ð Long_hair ¡ Ýµ ¼¡¼¡ ¾ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.18/48 .

Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex µ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.19/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex µ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.20/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

21/48 University of Applied Sciences Braunschweig/Wolfenbuettel .Example È ´Height ID 1 2 3 4 5 6 7 8 9 10 Ø Sex µ ½ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.

Example È ´Weight ID 1 2 3 4 5 6 7 8 9 10 Ð Sex µ ¿ ½ ¾ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f g n n m Data Mining – p.22/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

23/48 University of Applied Sciences Braunschweig/Wolfenbuettel .Example È ´Long_hair ID 1 2 3 4 5 6 7 8 9 10 Ý Sex µ ¾ ¿ Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.

24/48 University of Applied Sciences Braunschweig/Wolfenbuettel .Example È ´Sex µ ½¼ ¿ ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m Data Mining – p.

Example Ä´Sex Height Weight ½ ¡ Ø Ð Long_hair ¡ Ýµ ½ ¾ ¡ ¾ ¿ ¿ ½ ¿¼ ¼ Ä´Sex Height Weight Ø Ð Long_hair Ýµ Classiﬁcation of ´Ø Ð Ý µ: female (f) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.25/48 .

The data set does not contain any object with this combination of values.26/48 .Example The object ´Ø Ð Ý µ was classiﬁed by the urde durch den naïven Bayes-Klassiﬁkator klassiﬁziert. A full Bayes classiﬁer would not be able to classify this object. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

The main impact comes from the attribut Long hair = Ò. having probability 1 in class Ñ. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. but a low probability in class .More examples Input Ø Ýµ ´Ñ Ò Òµ ´Ø Òµ ´ Ä´Ñ ¾ ½ ¾ ¡ ¡ ¡ µ ¡ Ä´ ¼ ½ ¾¼ ½ ½¼ ½ ¾ ½ ¡ ¡ ¡ µ ¡ ¾ ¾ ¾ ¡ ¡ ¡ ¼ ½¼ ½¼ ½¼ ¼ ¿ ¼ ¡ ¡ ¡ ¡ ¡ ¾ ¾ ½¼ ½¼ ½¼ ¼ ½ ¿¼ ¡ ¡ ¼ Class ? m m The object ´Ñ Ò Òµ is classiﬁed by the naïve Bayes classiﬁer as Ñ although the data sets contains two such objects. one from class Ñ and one from class .27/48 .

28/48 . ½ Common choices: ½ or ¾ . ¼: Maximum likelihood estimation. then the overall likelihood is zero automtaically. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Laplace correction If a single likelihood is zero. Therefore: Laplace correction: È´ µ ´ ´ µ· Ò µ· is called Laplace correction. even the when the other likelihoods are high.

Laplace correction Laplace correction for È ´Height Sex Ñµ with ½ Height s m t # Laplace È ÈLaplace 1 2 1/4 2/7 1 2 1/4 2/7 2 3 2/4 3/7 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.29/48 .

When the naïve Bayes classiﬁer is applied to new data.30/48 . The probablity dsitribution for the single attributes should be stored in a table.Naïve Bayes classiﬁer: Implementation The counting of the frequencies should be carried out once when the naïve Bayes classiﬁer is constructed. only corresponding values in the table need to be multiplied. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

During classiﬁcation: Only the probabilities (likelihoods) of those attributes are multiplied for which a value is available.Treatment of missing values During learning: The missing values are simply not counted for the frequencies of the corresponding attribute.31/48 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Numerical attributes Estimation of probabilities: ¯ Numerical attributes: ´ Assume a normal distribution.32/48 . Ô Ü µ ½ ¾ ´ µ ÜÔ ´ Ü ´ µµ¾ ¾ ¾ ´ µ ¯ Estimation of the mean value ´ µ ½ ´ µ ´ µ Ü ´ µ ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Numerical attributes ¯ Estimation of the variance ¾ ´ µ ½ ´ µ ´ ½ Ü ´ µ ´ µµ ¾ ´ ´ : Maximum likelihood estimation µ ½: Unbiased estimation µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.33/48 .

Example ¯ ¯ ¯ ¯ 100 data points.34/48 . 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Classes overlap: classiﬁcation is not perfect Na¨ve Bayes classiﬁer ı ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Example ¯ ¯ ¯ ¯ ¯ 20 data points.35/48 . 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Attributes are not conditionally independent given the class Na¨ve Bayes classiﬁer ı University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

36/48 .) Comparison to the product rule È´ È´ µ È´ µ µ¡ È´ µ shows that this is equivalent to È´ µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Conditional independence ¯ Reminder: stochastic independence (unconditional) È´ µ È´ µ¡ È´ µ ¯ (Joint probability is the product of the individual probabilities.

e.37/48 .Conditional independence ¯ The same formulae hold conditionally. i. È´ È´ µ µ È´ È´ µ¡ µ È´ µ and University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

38/48 University of Applied Sciences Braunschweig/Wolfenbuettel .Conditional independence: Example Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕÕ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ ÕÕ Õ Õ ÕÕÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ ÕÕ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ ÕÕ ÕÕÕ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ ÕÕ Õ Õ Õ ÕÕ ÕÕ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕÕÕ ÕÕÕ Õ Õ Õ Õ ÕÕ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Group 1 Group 2 ¹ Data Mining – p.

Conditional independence: Example Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ Õ ÕÕÕÕ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ ÕÕ ÕÕ Õ ÕÕ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Group 1 ¹ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.39/48 .

Conditional independence: Example Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ ÕÕ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ ÕÕ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕÕÕÕÕ Õ Õ Õ ÕÕ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕÕÕ ÕÕÕ Õ Õ Õ Õ ÕÕ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Õ ÕÕ Õ Õ ÕÕ ÕÕ Õ Õ Õ Õ Õ ÕÕ Õ Õ Õ Õ Õ Õ Õ Õ Group 2 ¹ Data Mining – p.40/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) ¯ Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 6 misclassiﬁcations on the training data (with all 4 attributes) Na¨ve Bayes classiﬁer ı ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Example: Iris data ¯ 150 data points.41/48 .

Full Bayes classiﬁers ¯ Restricted to metric/numeric attributes (only the class is nominal/symbolic). ´ ½ ½ ¯ Simplifying Assumption: Ñ ½ µÑ Ñ ÜÔ µ ½ ¾ ´ Ô ´¾ ¦ µ ¦ ´ ½ µ : mean value vector for class ¦ : covariance matrix for class University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.42/48 . Each class can be described by a multivariate normal distribution.

43/48 .Full Bayes classiﬁers ¯ ¯ Intuitively: Each class has a bell-shaped probability density. Naive Bayes classiﬁers: Covariance matrices are diagonal matrices. (Details about this relation are given below.) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Full Bayes classiﬁers Estimation of Probabilities: ¯ Estimation of the mean value vector ½ ´ µ ´ µ ´ µ ½ ¯ Estimation of the covariance matrix ¦ ´ ´ µ ½ ´ µ ´ ´ µ ½ µ´ ´ µ µ : Maximum likelihood estimation µ ½: Unbiased estimation Data Mining – p.44/48 University of Applied Sciences Braunschweig/Wolfenbuettel .

45/48 .Naïve vs. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. full Bayes Classiﬁers ´ ½ ½ Ñ ½ Ñ ¡ µ ½ ¾ ´ µ ½ ¾ Ô Õ ´¾ µÑ ¦ ½ ÜÔ ¦ ½ ´ µ ´¾ µ Ñ ÉÑ ½ ½ ¡ ¾ ¡ ÜÔ ´ µ ¾ ½ µ¾ ¾ Ñ ´ µ ÉÑ ½ Õ Ñ ½ ¾ ¾ ÜÔ ½ ¾ Ñ ½ ´ ¾ µ¾ Õ ½ ¾ ¾ ¡ ÜÔ ´ Ñ ´ µ ¾ ¾ ½ where ´ µ are the density functions used by a naïve Bayes classiﬁer.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. full Bayes Classiﬁers Naïve Bayes classiﬁers for numerical data are equivalent to full Bayes classiﬁers with diagonal covariance matrices.Naïve vs.46/48 .

full Bayes Classiﬁers Na¨ve Bayes classiﬁer ı Full Bayes classiﬁer University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Naïve vs.47/48 .

3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) ¯ Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 2 misclassiﬁcations on the training data (with all 4 attributes) Full Bayes classiﬁer ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.48/48 .Full Bayes classiﬁer: Iris data ¯ 150 data points.

-Nearest neighbour classiﬁers Use the given data set as set of example cases ´ µ ´ µ ´ ½ µ( ½ Ò).1/26 . Ô Deﬁne a suitable distance measure To classify a new objekct ´ distances ´´ ½ ½ Ôµ on ½ ¢ ¢ Ô . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. . compute the ´ µ Ô Ôµ ´ ´ µ ½ µµ to all example cases.

among Assign ´ ½ Ô µ to the most frequent class the closest example cases.-Nearest neighbour classiﬁers Find the ´ ´ ½µ ½ closest example cases ´ ½µ Ô ½µ ´ ´ ½µ ½ ´ ½µ Ô µ.2/26 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

For ordinal attributes the distance should increase depending on the distances of the ranks in the corresponding lienar order.Distance measures ¯ For a nominal attribute two objects can only have the same value (distance=0) or different values (usually distance=1). ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.3/26 . For numerical attributes the absolute or the squared difference between values is very common.

Distance measures Unless the importance or weight of each attribute for the distance is known. the distances for the single attributes should yield similar values. For numerical attributes. the distance depends on the measurement unit. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.4/26 .

Measuring the weight in g and the height in m leads to almost neglectable differences for the height and to very large differences for the weight. Measuring the weight in kg and the height in cm leads to distances (differences) approximately in the same range.Distance measures Example: Weight and height of persons.5/26 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

6/26 . all attributes contribute roughly in the same way to the overall distance. Normalisation techniques for numerical attributes ¯ ¯ ¯ Ü Ü Ü Ñ Ò (extremely sensitive to outliers) Ñ Ü Ñ Ò Ü (sensitive to outliers) Ü × Ü interquartile_range median (robust against ouliers) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Distance measures Unless attributes are known to have different importance.

7/26 .Distance measures distance normalisation for arbitrary attributes given distance measure norm ´Ü : and data Ü½ Ò´Ò ½µ Ò ÜÒ ´Ü Ýµ È È Ò ¾ ´Ü ½ ·½ Ü µ Ýµ In this way: Average distance is 1 for all attributes. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

8/26 . Adapt the distance and/or select a suitable subset of cases from the sample database. Adaptive nearest neighbour classiﬁers: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.-Nearest neighbour classiﬁers Treatment of missing values: Training: Ignore the corresponding attributes for the computation of the distances. very fast slow for large sample databases Classiﬁcation: Interpretability: Justiﬁcation of the classiﬁcation based on similar known cases.

) How to ﬁnd the best separating hyperplane? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Support vector machines (SVM) Consider a two-class linearly separable classiﬁcation problem. (The two classes can be separated by a (hyper-)plane.9/26 .

10/26 .SVM training data set: ´Ü½ ½ µ ´ÜÒ Ò µµ where ¾ ½ indicating to which of the two classes Ü belongs. separating hyperplane given in the form ½ Û Ü ¯ ¯ ¼ Û: (non-normalised) normal vector of the hyperplane the normal vector Û : offset of the hyperplane from the origin along University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

For this.SVM Choose Û and such that the separating hyperplane yields a maximum margin.11/26 . ¾ Distance between the margin hyperplanes: Û University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. i. such that the distance to the closest points from the each of the two classes is as large as possible.e. introduce the constraints Û Ü Û Ü Û Ü ½ ½ if if ½ ½ ½ and Û Ü ½ are then the margin hyperplanes parallel to the separating hyperplanes through the closest points of the two classes.

12/26 .SVM University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Û involves a square root. Can be solved by standard quadratic programming techniques. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Therefore.SVM Optimization problem: Choose Û and to minimise Û under the constraints Û Ü for all ½ ¾ ½ Ò .13/26 . instead of Û minimise ½ ¾ Û ¾ .

Data Mining – p.14/26 University of Applied Sciences Braunschweig/Wolfenbuettel .SVM Dual form of the quadratic programming problem: Maximise Ò « ½ Ò ½ ¾ Ò «« ½ Ü Ü under the constraints « ½ ¼ and for all ¾ « ½ ¼ Ò .

15/26 . The separating hyperplane depends only on the support vectors Ü on the margin hyperplanes.SVM Û « ¼ Ò « ½ Ü only for those Ü that lie on one of the two margin hyperplanes. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

16/26 . ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.SVM Relaxation of the restriction to linearly separable classiﬁcation problems: Allow for misclassiﬁed objects between the two margin hyperplanes. Introduce slack variables measuring the degree of misclassiﬁcation of object Ü : Û Ü Penalise nonzero .

linear penalty function: Minimise ½ ¾ Û Ò ¾ · ½ subject to Û Ü for all ¼ ½ ¾ ½ Ò .17/26 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.SVM For example. is a constant specifying how strong misclassiﬁcations are penalised.

SVM: Kernel trick Rpelace the dot product Ü Ü¼ by a suitable kernel function Ã ´Ü Ü¼ µ.18/26 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. This corresponds to a (possibly nonlinear) transformation of the data to another space (of possibly inﬁnite dimension) where the separation of the two classes might be carried out easier.

19/26 . ¡ ¾ ¾ ½ . ÜÔ also called Gaussian kernel for Ø Ò ¼ ¡ ¯ sigmoid kernel: Ã ´Ü Ü¼ µ (for suitable choices of Ü Ü¼ · ¼).SVM: Kernel trick Common kernel functions: ¯ ¯ (homogeneous) polynomial: Ã ´Ü Ü¼ µ (inhomogeneous) polynomial: Ã ´Ü Ü¼ µ Ü Ü¼¡ Ü Ü¼ ¾ Ü Ü¼ · ½ ¡ ¯ radial basis function: Ã ´Ü Ü¼ µ ( ¼). and University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Assign an object to the class with the highest output (largest distance to the (virtual) separating hyperplane). ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.20/26 . Assign an object to the class which has won the most “competitions” against other classes. Construct an SVM for each pair of classes.SVM: Multiclass problems SVM for multiclass problems with more than two classes: ¯ Construct an SVM for each class against all other classes.

Strongly dependent on the choice of a suitable kernel.21/26 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Training has very high computational costs. especially for large data sets and for multiclass problems.SVM ¯ ¯ ¯ Often a very good classiﬁer.

Desired: A simple description of the function Ô´Üµ. Approach: Describe Ô by a logistic function: Ô´Üµ ½ ½· ¼· Ü ½ · ÜÔ ½ Ñ ¼ · ½ Ü University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Logistic regression ¯ ¯ ¯ ¯ : class attribute. ¾ Given: A set of data points Ü½ ÜÒ each of which belongs to one of the two classes ½ and ¾ .22/26 . ½ È´ Üµ ½ Ô´Üµ. ÓÑ´ µ ½ ¾ Ñ-dimensional random vector È´ Üµ Ô´Üµ.

23/26 .Classiﬁcation: Logistic regression Apply logit transformation to Ô´Üµ: ÐÒ ½ Ô´Üµ Ô´Üµ ¼ · Ü Ñ ¼ · ½ Ü The values Ô´Ü µ may be obtained by kernel estimation. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

which describes how strongly a data point inﬂuences the probability estimate for neighboring points.24/26 .Kernel estimation ¯ Idea: Deﬁne an “inﬂuence function” (kernel). ½ ´¾ ¯ Gaussian kernel Ã ´Ü Ýµ ¯ ¾µ ¾ Ñ ÜÔ Ýµ ¾ ´Ü ´Ü ¾ Ýµ Kernel estimate of probability density given a data set Ü½ ÜÒ : ´Üµ ½ Ò Ò Ã ´Ü Ü µ ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

) ½ and ´Ü µ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Kernel estimation ¯ Kernel estimation applied to a two class problem: Ô´Üµ È È Ò ½ Ò ´Ü µÃ ´Ü ½ Ã ´Ü Ü µ Üµ ( ´Ü µ ½ if Ü belongs to class otherwise.25/26 .

Classiﬁcation: Logistic regression University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.26/26 .

is a nominal attribute.1/95 .Regression Supervised learning: dom´ ½ µ ¢ Find a function minimizing the error ´ ´Ü½ Ü Ýµ. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. is a numerical attribute. objects ´Ü½ Classiﬁcation: Regression: Given a data set with attributes ¢ dom Ü Ý for all given data ´ µ µ µ ½ .

Regression line given: A data set for two continuous attributes Ü and Ý . It is assumed that there is an approximate linear dependency between Ü and Ý : Ý Ü· Find a regression line (i. What is a good ﬁt? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.e. determine the parameters and ) such the line ﬁts the data as good as possible.2/95 .

3/95 .Regression Ý-distance Euclidean distance Usually. It is equivalent to minimize the sum of squared errors in Ý -direction. the mean square error in Ý -direction is chosen as error measure (to be minimized). University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Regression Other reasonable error measures: ¯ ¯ ¯ mean absolute distance in Ý -direction mean Euclidean distance maximum absolute distance in Ý -direction (or equivalently: the maximum squared distance in Ý-direction) maximum Euclidean distance ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.4/95 .

5/95 .Regression line Given data ´Ü function is Ý ´ µ ( ½ Ò). the least squares error Ò ´´ µ ½ Ü · µ Ý µ ¾ (If at least two different Ü-values exist. ¼ and ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.) and are uniquely determined by the the necessary conditions for a minimum.

( ¾ independent of Ü.) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.6/95 . same dispersion of Ý for all Ü.Least squares and MLE A regression line can be interpreted as a maximum likelihood estimator (MLE): Assumption: The data generation process can be described well by the model Ý · Ü · where is normally distributed with mean 0 and (unknown) variance ¾ .e. i.

´ Ý Ü µ µ Ô ½ ¾ ¾ ¡ ÜÔ Ý ´ ´ ¾ · ¾ Ü µµ ¾ leading to the likelihood function Ä Ü½ Ý½ ´´ Ò ´ Ò ½ ÜÒ ÝÒ Ü Ý Ü ´ µ µ ´ ¾ µ µ ´ ½ Ü ¡Ô µ ½ ¾ ¡ ¾ ÜÔ Ý ´ ´ ¾ · ¾ Ü µµ ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.7/95 .Least squares and MLE Therefore.

Least squares and MLE To simplify the computation of derivatives for ﬁnding the maximum. we compute the logarithm: ÐÒ Ä Ü½ Ý½ ´´ µ ´ ÜÒ ÝÒ ½ ¾ µ ¾ µ Ò ÐÒ ´ ½ Ü ¡Ô µ ¾ ¡ ÐÒ ÜÔ Ý ´ ´ ¾ · ¾ Ü ´ µµ ¾ Ò ÐÒ ´ ½ Ü Ò µ · ½ Ô ½ ¾ ¾ ½ ¾ Ò ¾ ½ Ý ´ · Ü µµ ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.8/95 .

. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. and ¾ ) maximizing the likelihood function is equivalent to minimizing Ò ´ µ ´ ½ Ý ´ · Ü µµ ¾ Interpreting the method of least squares as a maximum likelihood estimator works also for the generalisations to polynomials and multilinear functions discussed later on.9/95 .Least squares and MLE From this expression it becomes clear that (provided ´Üµ is independent of .

r. the parameters does not work in the other examples of error functions.Multilinear regression Minimization of the error function based on partial derivatives w. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. since ¯ ¯ the absolute value and the maximum are not everywhere differentiable and the distance in the case of the Euclidean distance leads to system of nonlinear equation for which no analytical solution is known.10/95 .t.

) Ò ´ µ Ü ½ Ý Ü ¾ Ò ¾ ½ Ý Ý Ü Ò ¾ Ü ½ Ü Ü University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. (unlimited) Example: Ý growth of bacteria.11/95 .Nonlinear regression For nonlinear dependencies (in the parameters) taking partial derivatives leads to nonlinear equations: Ü (radioactive decay.

can also be derived from University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. when the regression function is linear in the coefﬁcients (parameters): Ý ´ Üµ ½ Ü ½ ½ Ü · · Ü Note that the attributes other attributes.12/95 .Linear regression The least squared approach leads to an analytical solution.

¿ . Ý : distance): Ý i. acceleration: Ü: time. constant Ü¾ · Ü · .Examples ¯ quadratic dependency (for instance.13/95 . ¾ . ½ . two (Ü¾ ). Ü¾ Ü. electricity consumption of a suburban area based on the number of ﬂats with one (Ü½ ).e. Ü½ Ü¾ . three (Ü¿ ) and four or more persons (Ü ) living in them): Ý Ü· ½ Ü½ · ¾ Ü¾ · ¿ Ü¿ · University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Ü¿ ½ ¯ linear dependency for different variables (for instance.

14/95 .Linear regression linear regression function Ý ´ Üµ ½ Ü ½ ½ Ü · · Ü Ò ´ ½ µ ´ ´ Ò ½ Ü µ Ý µ¾ ½ ´µ ½ ½ Ü · · Ü Ý ´µ ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

15/95 .Linear regression Ò Ø Ø Ò ¾ ¼ ½ ½ ½ ½ ´µ ½ ½ Ü · · ½ Ü´ µ Ý Ü´Ø µ Ü´ µÜ´Ø µ Ò ¾ ½ Ü Ý ´µ ¾ Ò ¾ ½ Ý Ü´Ø µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Linear regression

Ø

¼

implies

Ò

½ ½

Ü Ü

´µ ´µ Ø

Ò

½

ÝÜ

´µ Ø

for Ø

½

.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.16/95

Linear regression

Ò

½ ½

Ü´½ µÜ´½ µ

Ò

· ·

½

Ü´ µÜ´½ µ

Ò

½

Ý Ü´½ µ

. . .

Ò

½ ½

. . .

. . .

Ò

·

Ü´½ µÜ´ µ

·

½

Ü´ µ Ü´ µ

Ò

½

Ý Ü´ µ

Data Mining – p.17/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Linear regression

normal equation:

Ò

½

¼ ½

ÜÜ

. . .

½

Ò

½

ÝÜ

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.18/95

Linear regression

The coefﬁcients are then given by

¼ ½

. . .

½

Ò

½

ÜÜ

½

Ò

½

ÝÜ

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.19/95

**Model vs. black box
**

When the principal functional dependency between the predictor variables and the predictor variables Ü½ ÜÔ is known, an explicit parameterised (possibly nonlinear) regression function can be speciﬁed. If such a (model) is not known, one can still try to construct a suitable regression function.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.20/95

**Model vs. black box
**

When the functional dependency between the predictor variables ½ is not known, one can try a

¯ ¯

linear

Ý

¼

¼

·

½ ½

Ü

·

·

Ü

¡Ü ¡ Ü¾

·

quadratic

Ý

¯

¡ Ü¾ ·½ ½ ¾ ·½ ¡ Ü½ Ü¾

·

·

½

¡ Ü½

· · ·

·

¾

·

·

¾ · ´

½µ

¾

¡ Ü ½Ü

Data Mining – p.21/95

or cubic approach.

University of Applied Sciences Braunschweig/Wolfenbuettel

**Model vs. black box
**

The coefﬁcients can be interpreted as weighting factors, at least when the predictor variables ½ have been normalised. They also provide information of a positive or negative correlation of the predictor variables with the dependent variable . Usually, complex regression functions yield black box models, which might provide a good approximation of the data, but do not admit a useful interpretation (of the coefﬁcients).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.22/95

Generalisation

Considering a data set as a collection of examples, describing the dependency between the predictor variables and the dependent variable, the regression function should “learn” this dependency from the data and to generalise it to new data to make correct predictions. To achieve this, the regression function must be universal (ﬂexible) enough to be able to learn the dependency. This does not mean that a more complex regression function with more parameters leads to better results than a simple one.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.23/95

Overﬁtting

Complex regression functions can lead to overﬁtting:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.24/95

Overﬁtting

Complex regression functions can lead to overﬁtting:

The regression function “learns” a description of the data, not of the structure inherent in the data. Predicition can be worse than for a simpler regression function.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.25/95

Approximation vs. extrapolation

Distinction between

¯ approximation ¯ extrapolation,

of the data and

corresponding to a prediction in regions where no data points are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.26/95

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. of the data and corresponding to a prediction in regions where no data points are available.Approximation vs.27/95 . extrapolation Distinction between ¯ approximation ¯ extrapolation.

Approximation vs. extrapolation Distinction between ¯ approximation ¯ extrapolation. of the data and corresponding to a prediction in regions where no data points are available. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.28/95 .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.29/95 . extrapolation Distinction between ¯ approximation ¯ extrapolation. of the data and corresponding to a prediction in regions where no data points are available.Approximation vs.

30/95 .Robust regression linear model: Ý computed model: « ¬½ Ü ½ Ü ¬ · · · · · ¬Ü Ü · Ý objective function: Ò ½ Ü Ü½ · · · · Ò ´ µ ´ ½ ½ Ý Ü µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

if University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. . ¯ ¯ ¯ ¯ ´ µ ´¼µ ´ µ ´ µ ¼ ´ µ ¾. . should satisfy at least .31/95 .M-estimators least squares method: For other choices of . ´ . ´ ¼ µ µ .

Ò ½ ´ Ý Ü leads to Ò ½ ´ Ý Ü µ ¡ ¡Ü Ò ½ Û ¡ Ý Ü ´ µ ¡Ü ¼ Solution of this system of equations is the same as for the weighted least squares problem Ò ½ Û ¾ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Û´ È µ ´ µ and Û µ Û ´ µ .32/95 .M-estimators Deﬁne Computing derivatives of Ò ½ ´ µ È ¼ .

and Solution strategy: Alternating optimisation.33/95 .M-estimators Problem: ¯ ¯ The weights Û depend on the errors the errors depend on the weights Û . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Robust regression Method least squares Huber Tukey ´ µ ´ ¾ ½ ¾ ¾ ¾ ¾ ½ ¾ ¾ ½ ½ if if ¡ ¾ ¿ ¾ if íf University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.34/95 .

8 25 0.6 rho 20 w 0.35/95 15 10 0 2 4 6 University of Applied Sciences Braunschweig/Wolfenbuettel .2 5 0 -6 -4 -2 0 e 2 4 6 -6 -4 -2 0 e Data Mining – p.M-estimators: Least squares 40 1 35 30 0.4 0.

4 0.2 1 0 -6 -4 -2 0 e 2 4 6 -6 -4 -2 0 e Data Mining – p.36/95 3 2 0 2 4 6 University of Applied Sciences Braunschweig/Wolfenbuettel .M-estimators: Huber 8 1 7 0.8 6 5 0.6 rho 4 w 0.

8 2.37/95 2 4 6 University of Applied Sciences Braunschweig/Wolfenbuettel .5 0 -6 -4 -2 0 e 2 4 6 0 -6 -4 -2 0 e Data Mining – p.6 0.2 0.5 0.4 1 0.5 2 rho w 1.5 1 3 0.M-estimators: Tukey 3.

38/95 . robust regression Y −2 −4 −1 0 1 2 −2 0 X 2 4 6 least squares and robust regression University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Least squares vs.

6 0.0 1 2 3 Index 4 5 6 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.4 0.8 1.2 0.Robust regression: Weights Huber weight 0.39/95 .

) > summary(reg.rlm(y ˜ x1 + x2 + .huber) Plotting the weights and enable clicking interesting points (here: size of the data set = 100): > plot(reg.40/95 ....Robust regression with R > library(MASS) > reg.huber$w) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. data=. ylab="Huber Weight") > identify(1:100..huber$w.huber <. reg..

..... method=’MM’) . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.... Otherwise.Robust regression with R Tukey’s approach requires the package lqs.rlm(y ˜ x1 + x2 + .41/95 .. analogous to Huber’s approach: ..bisq <. data=. > reg.

given that the data set is sufﬁciently large and covers all combinations. a regression function can be constructed for each possible combination of the values of the nominal attributes.42/95 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Regression & nominal attributes If most of the predictor variables are numerical and the few nominal attributes have small domains.

Yes). Possible solution: Construct four separate regression functions for (F.Yes).(M.Regression & nominal attributes Attribute Type/Domain Sex F/M Vegetarian Yes/No Age numerical Height numerical Weight numerical Example: Task: Predict the weight based on the other attributes.(F.No). University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.(M.No) using only age and height as predictor variables.43/95 .

unless the nominal attribute is actually ordinal. Do not encode nominal attributes with more than two values in one numerical attribute.Regression & nominal attributes Alternative approach: Encode the nominal attributes as numerical attributes. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. a 0/1 or ½ ½ numerical attribute should be introduced for each possible value of the nominal attribute.44/95 . ¯ ¯ Binary attributes can be encoded as 0/1 or ½ ½ For nominal attributes with more than two values.

(blue line) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. but a numeric quantity.Regression trees Like decision trees. ¯ Simple regression trees: Predict constant values in leaves. but target variable is not a class.45/95 .

Regression trees ¯ More complex regression trees: Predict linear functions in leaves. (red line) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.46/95 .

47/95 University of Applied Sciences Braunschweig/Wolfenbuettel .Regression trees: Attribute selection ¯ ¯ The variance/standard deviation is compared to the variance/standard deviation in the branches. The attribute that yields the highest reduction is selected. Data Mining – p.

Regression Trees: An Example A regression tree for the Iris data (petal width) (induced with reduction of sum of squared errors) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.48/95 .

A neuron receives input signals from excitatory (positive) synapses (connections to other neuron). being either active or inactive.49/95 . A neuron receives input signals from inhibitory (negative) synapses (connections to other neuron).Neural networks McCulloch-Pitts model of a neuron (1943) ¯ ¯ ¯ ¯ A neuron is a binary switch. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Each neuron has a ﬁxed threshold value.

50/95 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.McCulloch-Pitts model ¯ The inputs of a neuron are accumulated (integrated) for a certain time. Aim of the McCulloch-Pitts model: neurobiological modelling and simulation to understand very elementary functions of neurons and the brain. When the threshold value of the neuron is exceeded. the neuron becomes active and sends signals to its neighbouring neurons via its synapses.

otherwise 0. The stimulus is passed on to an output neuron via a weighted connection (synapse).The simple perceptron The perceptron was introduced by Frank Rosenblatt for modelling pattern recognition abilities in 1958. Aim: Automatic learning of the weights and the threshold to classify objects shown on the retina correctly. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. When the threshold of the output neuron is exceeded. the output is 1.51/95 . ¯ ¯ ¯ A simpliﬁed retina is equipped with receptors (input neurons) that are activated by an optical stimulus.

The simple perceptron A perceptron for identifying the letter F. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Two positive and one negative example.52/95 .

53/95 .The simple perceptron schematic model of a perceptron w e ig h ts ´ o u tp u t n e u ro n in p u t la y e r University of Applied Sciences Braunschweig/Wolfenbuettel ½ ¼ if otherwise ÈÛ Data Mining – p.

For multiclass problems use one perceptron per class. output class (0 or 1).The simple perceptron Perceptron: classiﬁcation for two-class problems. (Numerical) input attributes . Two parallel perceptrons University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.54/95 .

55/95 . Each time. the perceptron predicts the wrong class. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. adjust the weights and the threshold value. For each data object in the training data set. check whether the perceptron predicts the correct class. Repeat this until no changes occur.Perceptron learning algorithm ¯ ¯ ¯ ¯ Initialise the weights and the threshold value randomly.

The delta rule When the perceptrons makes a wrong classiﬁcation. the threshold is not exceeded.56/95 ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . increase the threshold and adjust the weights depending on the sign and magnitude of the inputs. although it should not be. although it should be. ¯ If the desired output is 1 and the perceptron’s output is 0. If the desired output is 0 and the perceptron’s output is 0. lower the threshold and adjust the weights depending on the sign and magnitude of the inputs. change the weights and the threshold at least in the correct direction. Therefore. Data Mining – p. the threshold is exceeded. Therefore.

The delta rule ¯ ¯ ¯ ¯ ¯ ¯ ´ Û : A weight of the perceptron : The threshold value of the output neuron Ø: desired output for input ½ Ò ØÔ: The perceptron’s output for input ´ ¼ ½ Ò µ: An input µ ´ ½ Òµ : Learning rate University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.57/95 .

58/95 .The delta rule The delta rule recommends to adjust the weights and the threshold value according to: Ûnew new Ûold · ¡ Û old · ¡ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

The delta rule if ØÔ if ØÔ if ØÔ if ØÔ if ØÔ if ØÔ ¼ ¡ Û Ø ¼ ½ · ¼ and Ø and Ø ½ ¼ ¡ · Ø ¼ ½ and Ø and Ø ½ ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.59/95 .

Epoch 0 0 01 10 11 Ø ØÔ 0 0 0 0 0 0 1 0 Û¾ ½ ¼ ¡ Û½ 0 0 0 1 ¡ Û¾ 0 0 0 1 ¡ Û½ Û¾ 0 0 0 1 0 0 0 1 0 0 0 ½ 0 0 0 ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.60/95 . Training data: ´´¼ ¼µ ¼µ ´´¼ ½µ ¼µ ´´½ ¼µ ¼µ ´´½ ½µ ½µ Learning rate: Initialisation: Û½ 1.Example Learning of the logical operator AND.

Epoch 0 0 01 10 11 3.61/95 . Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 ¡ Û½ 0 0 0 1 0 0 ½ ¡ Û¾ 0 ½ ¡ Û½ Û¾ 1 1 1 2 2 2 1 2 1 0 0 1 1 0 0 1 0 1 1 0 0 1 2 1 0 1 0 ½ 1 1 0 ½ 1 0 1 0 1 1 ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Example Ø ØÔ 2.

Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 ¡ Û½ 0 0 ½ ¡ Û¾ 0 0 0 1 0 ½ ¡ Û½ Û¾ 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 2 2 2 1 0 0 0 0 0 0 1 ½ 0 0 0 1 0 0 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Epoch 0 0 01 10 11 5.Example Ø ØÔ 4.62/95 .

Example Ø ØÔ 6.63/95 . Epoch 0 0 01 10 11 0 0 0 0 0 0 1 1 ¡ Û½ 0 0 0 0 ¡ Û¾ 0 0 0 0 ¡ Û½ Û¾ 2 2 2 2 1 1 1 1 2 2 2 2 0 0 0 0 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

there exists a perceptron that can classify all patterns correctly.64/95 . then the delta rule will adjust the weights and the threshold after a ﬁnite number of steps in such way that all patterns are classiﬁed correctly. Which kind of classiﬁcation problems can be solved by a perceptron? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. for a given data set with two classes.Perceptron convergence theorem If.

Let ØÔ be the output of the perceptron for input ´ ½ ¾ µ.65/95 University of Applied Sciences Braunschweig/Wolfenbuettel .Linear separability Consider a perceptron with two input neurons. Then ØÔ ½ ´µ Û½ ¡ ´µ ¾ Û½ Û ¾ ½ · Û¾ ¡ ½ · ¾ Û¾ The perceptron’s output is 1 if and only if the input pattern ´ ½ ¾ µ is above the line Ý Û½ Ü Û Û¾ ¾ · Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.66/95 .w1 x + θ w2 Æ class 0 ¯ class 1 Klasse 1 Klasse 0 The parameters Û½ Û¾ determine a line. All input patterns above this line are assigned to class. the input patterns below the line to class 0.Linear separability y = .

if there exists a hyperplane separating the two classes.Linear separability More generally: A perceptron with Ò input neurons can classify all examples from a data set with Ò input variables and two classes correctly. Such classiﬁcation problems are called linearly separable.67/95 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

68/95 .Linear separability Example: The exclusive OR (XOR) deﬁnes a classiﬁcation task which is not linearly separable. 1 0 1 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

5 1 1 A perceptron with a hidden layer of neurons: The hidden layer carries out a transformation. 1 1 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.XOR with hidden layer 1 1 1 1 1 .69/95 . The output neuron can solve the linearly separable problem in the transformed space.5 1 0 .5 -1 0 .

70/95 .Learning algorithm? How to adjust the weights (and thresholds) for the neurons the hidden layer? Problem: Solution: Multilayer perceptrons with gradient descent Does not work with binary (non-differentiable) threshold function as activation function for the neurons. Must be replaced by a differentiable function. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

6 0.5 0.9 0.71/95 .8 0.5*x)) 1/(1+exp(-2*(x-3))) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Sigmoidal activation function sigmoidal functions and steepness « with bias ¼: ½ · ÜÔ´ Ù ´netÙ µ « netÙ ´ ½ µµ 1 0.2 0.1 0 -10 -5 0 5 10 1/(1+exp(-0.4 0.3 0.7 0.

Activation functions for neurons are usually sigmoidal functions. Connections exist only between neurons from one layer to the next layer.Multilayer perceptrons A multilayer perceptron is a neural network with an input layer.72/95 . one or more hidden layers and an output layer. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

73/95 .Multilayer perceptrons in p u t la y e r h id d e n la y e r h id d e n la y e r o u tp u t la y e r University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Error function (mean) squared error: ½ ¾ Ô Ú ¾Íout ´ Ø ´Ôµ Ú ´Ôµ ¾ Ú µ ¯ ¯ ¯ Íout: the set of output neurons ´Ôµ ØÚ : target output of output neuron Ú for input ´Ôµ ´Ôµ Ú : output (activation) of output neuron Ú for input ´Ôµ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.74/95 .

Learning rule Activation of a neuron: ´Ôµ Ú ´ net´Ôµ µ Ú Input for a neuron including bias values: net´Ôµ Ú Ù¾Í ÏÙÚ ´ µ ´Ôµ Ù Ú Í : set of neurons in the layer before neuron Ú University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.75/95 .

i.e. Adjust the weights based on a gradient descent technique. ¡Ô ÏÙÚ ´ µ ÏÙÚ ´ ´Ôµ µ ¼ : learning rate ÏÙÚ ´ ´Ôµ ´Ôµ µ net´Ôµ Ú ¡ ÏÙÚ ´ net´Ôµ Ú µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.76/95 . proportional to the gradient of the gradient of the error function.Learning rule Consider the input/output pattern Ô.

77/95 .Learning rule ÏÙÚ ´ net´Ôµ Ú µ ÏÙÚ ´ µ Ù¼ ¾Í Ï ´Ù¼ Ú µ ´Ôµ Ù¼ Ú ´Ôµ Ù Deﬁne the error signal ´ ÆÚÔµ ´Ôµ net´Ôµ Ú University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

¡Ô ÏÙÚ ´ µ ´ ¡ ÆÚÔµ ¡ ´Ôµ Ù ´ ÆÚÔµ ´Ôµ net´Ôµ Ú ´Ôµ ´Ôµ Ú ¡ ´Ôµ Ú net´Ôµ Ú University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Learning rule Therefore: ÏÙÚ ´ ´Ôµ µ ´ ÆÚÔµ ¡ ´Ôµ Ù i.78/95 .e.

Learning rule ´Ôµ Ú net´Ôµ Ú ´Ôµ ´Ôµ Ú ¼ ´net´Ôµ µ Ú To compute ¯ . ¼ ´net´Ôµ µ´Ø´Ôµ Ú Ú ´Ôµ Ú µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.79/95 . i. consider the two cases: ´Ôµ ´Ôµ Ú Ú is an output neuron: ´ ÆÚÔµ Ø ´ ´Ôµ Ú ´Ôµ Ú µ.e.

Learning rule ¯ Ú is a hidden neuron in layer ´Ôµ netÚ ¡ ´Ôµ net ´Ôµ : ´Ôµ ´Ôµ Ú Ú ¾Í ·½ Ú ´Ôµ Ú ´Ôµ Ú ¾Í ·½ Ú ¾Í ·½ netÚ´Ôµ ´Ôµ ¡ ´ ´Ôµ Ú Ù¼ ¾Í µ Ï Ù¼ Ú ´ µ ´Ôµ Ù¼ Ú net´Ôµ Ú ¡Ï Ú Ú ´ µ Ú ¾Í ·½ ´ ÆÚÔµÏ Ú Ú University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.80/95 .

¡Ô ÏÙÚ ´ µ ´ ÆÚÔµ ´Ôµ Ù University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Backpropagation Leading to: ´ ÆÚÔµ ¼ ´net´Ôµ µ Ú Ú ¾Í ·½ ´ ÆÚÔµÏ Ú Ú ´ µ Result: Recursive equation for updating the weights: Update the weights to the neuron in the output layer ﬁrst and then go back layer by layer and update the corresponding weights.81/95 .

Backpropagation can also be applied in case there are connections between from layers that are not neighboured as long as the neural network represent a directed acyclic graph.Backpropagation where Æ ´Ôµ Ú ¼ ´net´Ôµ µ´Ø´Ôµ Ú Ú Ú ¼ ´net´Ôµ µ È¾ ´Ôµ Ú µ if Ú ¾ Íout ´ µ Ú Í ·½ Æ ÏÚÚ ´Ôµ Ú if Ú ¾ Í This learning rule is called Backpropagation or generalised delta rule.82/95 . These networks are also called feedforward networks. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

83/95 .Training the bias values The bias values can be considered as special weights when an artiﬁcial input neuron Ù with constant input 1 is introduced: u 1 u 0 u 2 G G G University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

84/95 .Training the bias values The bias values can be learned in the same manner as the weights based on the following considerations: ÏÙÚ ÔÏ Ù Ú ´ µ ¡ ´ µ ÏÙ Ú ÔÏ Ù Ú ´ µ ¡ ´ Ú µ ½ ¡Ô Ú ´Ôµ Ù ´Ôµ Ù University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

e. ¯ ¯ ¯ A very large leads to skipping minima or oscillation.85/95 University of Applied Sciences Braunschweig/Wolfenbuettel . Training the networks with different random initialisations will usually lead to different results. slow convergence or even convergence before the (local) minimum is reached.Backpropagation ¯ Backpropagation as a gradient descent technique can only ﬁnd a local minimum. Data Mining – p. A very small leads to starving. i. The learning rate deﬁnes the stepwidth of the gradient descent technique.

Typical choices for and ¬ : ´ µ ÏÙÚ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. otherwise it decreases.86/95 . the weight change increases. the previous weight change is taken into account: ¡Ô ¡Ô ÏÙÚ ´ µ ´ ÆÚÔµ ´Ôµ Ù · ¬ ÕÏ Ù Ú ¡ ´ µ is the weight change in the previous step of the gradient descent algorithm. ¼ ¾. If weight is changed continuously in the same direction.Backpropagation ¯ Introduce a momentum term: For the weight change. ¬ ¼ .

Data Mining – p. how to choose the number of hidden layers and the size of the hidden layers.e. but after a whole epoch. There is no general rule. weights are not updated after the presentation of each input pattern. after all patterns have been presented once. Small neural networks might not be ﬂexible enough to ﬁt the data. The steepness of the activation function is usually ﬁxed and is not adjusted. otherwise the derivative is almost zero. i. Large neural networks tend to overﬁtting. The multilayer perceptron learns only in those regions where the activation function is not close to zero or one.Backpropagation ¯ ¯ ¯ ¯ Usually.87/95 University of Applied Sciences Braunschweig/Wolfenbuettel .

Approximate the error function locally by a quadratic curve and ﬁnd in each step the minimum of the quadratic curve. pushing all weights in the direction of zero. Only those weights will “survive” that are really needed.88/95 . ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Other learning algorithms ¯ Weight decay is sometimes included.

the neural network should learn the identity function. After training.89/95 ¯ ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel .Nonlinear PCA Dimension reduction with multilayer perceptrons: ¯ Input and output are identical. Data Mining – p. input the data into the network and use the outputs of the bottleneck neurons for the graphical representation. (Autoassociative network) Introduce a hidden layer with only two neurons (representing the two dimensions for the graphical representation of the dimension reduction). i.e. Train the neural network with the data. the bottleneck.

Autoassociative bottleneck neural networks University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.90/95 .

Mathematical approximation method for multivariate functions. have little inﬂuence on the regression function. usually with Gaussian (unimodal) activation functions.91/95 . ¯ Support vector regression (SVR): Similar idea as SVMs.Other regression models ¯ Radial basis function networks (RBF networks) have only one hidden layer. Those points that are approximated well. ¯ Multivariate adaptive regression splines (MARS): ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

5 as a cut-off value. The regression function will usually not yield exact outputs 0 and 1. the mean squared error). but the classiﬁcation decision can be made by considering 0. but not misclassiﬁcations. Problem: The objective functions aims at minimizing the function approximation error (for example.92/95 .Classiﬁcation as regression A two-class classiﬁcation problem (with classes 0 and 1) can be viewed as regression problem. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

except for 9 data objects where it yields 1 instead of 0 and vice versa.Classiﬁcation as regression Example: 1000 data objects. Regression function always yields the exact and correct values 0 and 1.1 for all data from class 0 and 0.93/95 . 500 belonging to class 0. ¯ ¯ Regression function yields 0. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.9 for all data from class 1. 500 to class 1.

01 9 0. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.94/95 .Classiﬁcation as regression Regression function Misclassiﬁcations MSQE 0 0.009 From the viewpoint of regression is better than . from the viewpoint of misclassiﬁcations should be preferred.

For example: A data object between class 1 and 3 might be classiﬁed as class 2 by the regression function. This leads to interpolation errors. Train a classiﬁer (regression function) for each class against all other classes.Classiﬁcation as regression For multiclass problems do not enumerate the classes and learn a single regression function.95/95 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Model validation ¯ ¯ ¯ ¯ Does the model ﬁt the data at all? What is the most appropriate model? Are the ﬁndings from the model signiﬁcant at all? How will the model performance be for new data? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.1/25 .

5 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.0 pwidth 1.0 2.5 1.2/25 .Model ﬁtting Fitting a linear model to a sample 7 plength 1 2 3 4 5 6 0.5 2.

0 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.3/25 .5 7.0 5.5 5.0 2.0 7.0 slength 4.5 6.0 6.5 2.0 swidth 3.Model ﬁtting Fitting a linear model to a sample with (almost) no correlation 8.5 4.5 3.

Model ﬁtting Fitting a linear model to a sample with nonlinear dependency 100 y 0 20 40 60 80 0 2 4 x 6 8 10 University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.4/25 .

(The regression line does not really reﬂect any meaningful relation between the variables.) Therefore. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. (A regression line can always be found for two-dimensional data. model validation is needed.Model validation Fitting a model to the data is (almost) always possible.5/25 .) It does not mean that the model that ﬁts best to the data ﬁts to the data at all.

complex models How complex should a model be? A complex model like ¯ ¯ a decision tree with many nodes or a nonlinear regression function will usually ﬁt the data better than a simple model.6/25 .Simple vs. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. they tend to overﬁtting. Are complex models always better? No.

) Ý Ý Ü· Ü¾ · ½ Ü · ¾ ¼. ½ Ý ¼. complex model Example: Assume that there is an unknown noisy functional dependency between attributes and . is the best one? University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.7/25 . Ý Which of the models ¯ ¯ ¯ ´Üµ · Ü ( Ü is an unknown random noise.Simple vs. ¼.

For Ò data points the error will be zero for Ý Ò ½ÜÒ ½ · ½Ü · ¼. One way for choosing the model: crossvalidation (will be discussed later on) Alternative: Minimum description length principle (MDL) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Simple vs. Tends to overﬁtting. complex model The error on the training data set will decrease with increasing model complexity (here: higher degree of the polynomial).8/25 .

9/25 . Use as few bit as possible. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Similar problem as for data compression: Find a coding which is as compact as possible.MDL Basic idea of the minimum description length principle: The data should be send over a channel or should be stored in a compressed form in a ﬁle.

But a complex compression scheme will need larger memory itself. The more complex the rule for decompression. the more the data can be compressed.10/25 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.MDL The compressed ﬁle consists of the compressed data and a rule how to decompress the data (the model).

MDL Goal: Find a compression for which compressed data + description of the (de-)compression rule is minimal.11/25 . For the regression example: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

MDL Store (or transmit) the data ´Ü½ Ý½ µ precision of one decimal place.12/25 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. For a model of the form ´ÜÒ ÝÒµ with Ý ´Üµ Ü · · ½Ü · ¼ the values Ü and the errors ¡Ý (instead of the values Ý ) must be transmitted.

13/25 . the model parameters must be transmitted. ½¼¿ ½µ ´½¼ ¼ ½¾½ ¾µ ´½¾¼ ¼ ½ ¼ µ µ ) ½ ´´½ ¼ ¼ ½µ ´½¼ ¼ ¼ ¾µ ´¾¼ ¼ ¼ ¾µ In addition. University of Applied Sciences Braunschweig/Wolfenbuettel ¾ und ¼ ½¼½ Data Mining – p.MDL ´´Ü½ Ý½µ ´½ ¼ ½¼¿ ½µ ´Ü¾ Ý¾µ ´½¼ ¼ ½¾½ ¾µ ´Ü¿ Ý¿µ ´¾¼ ¼ ½ ¼ µ µ for example: For the model Ý Instead of ´´½ ¼ transmit ¾Ü · ½¼½.

14/25 .MDL Minimum description length principle: Choose the model for the data for which length of the model description + length of the data (model error) description becomes under the considered model becomes minimal. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.15/25 .MDL Similar for decision trees: ¯ ¯ Complex trees need a longer description. But less classiﬁcation errors need to be transmitted for complex trees.

Estimation of the error is based on the test data.16/25 .Training and test data The error on the training data set is called resubstitution error. The data should be split into training and test data. Only the data from the training set willbe used to construct the model. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Training data for learning models to choose the best model Validation data Test data for estimation of the predcition error.Validation data Sometimes the data set is partitioned into three data set.17/25 .

a naïve Bayes classiﬁer and a nearest neighbour classiﬁer are constructed. validation and test data. Estimate the prediction error of the chosen classiﬁer based on the test data set. The data set is partioned into training.Validation data Example: Construction of a classiﬁer. Choose the classiﬁer which is best on the validation data. Based on the training data a decision tree.18/25 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

validation and test data. The validation data is used to compute the error during learning without inﬂuencing the training. backpropagation) is carried out with the training data. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The data set is partioned into training. The learning algorithm (for instance.19/25 .Validation data Example: A neural network (multilayer perceptron) should be trained for a regression or classiﬁcation problem.

20/25 . E rro r The prediction error of the neural network is calculated based on the test data. when the error on the validation data has dropped to a minimum and is increasing again.Validation data Training of the neural network is stopped. v a lid a tio n d a ta tra in in g d a ta s to p tra in in g L e a rn e p o c h s University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

one of the subsets is left for testing. while the others are used for traing. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Usually the mean of th vlues is taken as the prediction error. This is called -fold crossvalidation. In this way. Each time.21/25 .Crossvalidation In order to have a more robust estimation of the prediction error. the data set is split into disjoint subset of approimately the same size and the model is trained times. a prediction error can be computed times.

22/25 .Leave-One-Out For small data sets with Ò data objects. Only one data object is left for testing each time- University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. the Leave-One-Out or jackknife method is applied meaning Ò-fold crossvalidation.

some data objects are in the training set with multiple copies. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. other might not be in the training set. 0.Bootstrapping For the Bootstrap method the training data set is a drawn as a sample with replacement from the whole data set.632 bootstrap draws a sample of Ò data objects for training from a data set of size Ò.23/25 . Since sampling is carried out with replacement.

24/25 . Ò Ò ½ ¼¿ This means that the training set consists in average of 63.2% of the original data.Bootstrapping The probability that a data object is not chosen in a single draw is ½ ½ Ò The probability that a data set is not selected in the sample for the training data is ½ ½ for large Ò. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

estimated overall error ¼ ¿¾¡ error on the test data set · ¼ ¿ ¡ error on the trainig data set University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The error is a weighted sum of the resubstitution error and the error on the test data set.25/25 .Bootstrapping Testing for bottstrapping is carried with the training and the testing data.

1/37 .Bayesian networks Basic idea of Bayesian networks: Representation of a probability distribution over a high dimensional space. Reasoning: How does the change of a single or some marginal distributions change the other marginal distributions? University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

. .2/37 .Bayesian networks: Example Customer questionnaires: ¯ Each question has a ﬁnite number of answers like: Are you satisﬁed with ¯ ¯ ¯ ¯ ¯ ? 1: very satisﬁed 2: satisﬁed . 6: very dissatisﬁed 7: not applicable/don’t now University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

3/37 . The set of questionnaires ﬁlled in by customers induces a probability distribution on the product space of all attributes (questions) with dependencies between the questions.Bayesian networks: Example ¯ ¯ Each question represents a nominal attribute. How do these dependencies look like? How can this be used to improve customer satisfaction? ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.Bayesian networks: Example Assume. How should this money be spent? Answers to target questions like ¯ ¯ Are you satisﬁed with our service technicians? Are you satisﬁed with our hotline? cannot be inﬂuenced directly. the management provides 1 million Euros to improve customer satisfaction.4/37 .

5/37 .Bayesian networks: Example Answers to other questions with correlations to the target questions can be inﬂuenced. For instance: ¯ ¯ ¯ ¯ Did the service technician show up on time? Was the service technician able to solve your problem immediately? How long did you have to wait at the hotline? Could the staff at the hotline answer your questions? University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

6/37 . suitable actions can be taken to improve the answers to the target questions like ¯ ¯ employ more technicians/staff at the hotline provide better training for the technicians/staff at the hotline University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.Bayesian networks: Example Knowing the inﬂuence of such questions on the target questions.

7/37 .Bayesian networks: Example University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

8/37 . University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.High-dimensional data spaces Number of possible combinations for 200 binary attributes: ¾¾¼¼ ½ ¡ ½¼ ¼ ¿ ¡ ½¼ ¼ Gigabyte Most of the combinations will not be admissible or relevant.

Decomposition 6 5 4 Ý Ý Ý Ý Ý Ý 1 2 3 Ý Ý 4 5 ¹ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.9/37 .

Decomposition 6 5 4 1 Ý Ý Ý Ý Ý Ý ¹ 2 3 4 5 University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.10/37 .

Decomposition 6 5 4 1 ¾ Ý Ý Ý Ý Ý Ý and 2 3 4 5 ¹ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.11/37 .

Decomposition University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.12/37 .

13/37 .Decomposition University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

Decomposition University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.14/37 .

Decomposition University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.15/37 .

16/37 .Decomposition Table size Ò ¡ Ò University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

Decomposition Table size Ò ¡ ¡ Ò Ò Ò University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.17/37 .

18/37 .Decomposition Table size Ò ¡ ¡ Ò Ò Ò Ò ¡ Ò ·Ò ¡ Ò Ò ¡ Ò ¡ Ò (in general) University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

Decomposition of probability distributions Bayes’ theorem: È ´ Ü Ý µ È ´ Ü µ ¡ È´ Ý Ü µ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.19/37 .

20/37 .Decomposition of probability distributions Bayes’ theorem: È ´ Ü Ý µ È ´ Ü µ ¡ È´ Ý Ü µ iterative application: È ´ Ý Ü Ý Þ µ Þ Ü Ý È ´ Ü µ ¡ È´ Ü µ ¡ È´ µ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

Decomposition of probability distributions Bayes’ theorem: È ´ Ü Ý µ È ´ Ü µ ¡ È´ Ý Ü µ iterative application: È ´ Ý Ü Ý Þ µ Þ Ü Ý È ´ Ü µ ¡ È´ Ü µ ¡ È´ µ in case of conditional independence: È ´ Ü Ý Þ µ Þ Ý È ´ Ü µ ¡ È´ Ý Ü µ ¡ È´ µ Data Mining – p.21/37 University of Applied Sciences Braunschweig/Wolfenbuttel ¨ .

22/37 . University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.Conditional independence and are conditional independent given È if ´ µ È ´ µ holds.

Conditional independence Example: Age Work experience Salary University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.23/37 .

24/37 .Conditional independence Example: Age Work experience Salary All three variables are correlated and pairwise dependent. University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

25/37 .Conditional independence Example: Age Work experience Salary All three variables are correlated and pairwise dependent. But: ´Salary Age. Work experienceµ È È ´Salary Work experienceµ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

26/37 .Conditional independence A g e W o r k e x p e r ie n c e S a la r y University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

27/37 .Conditional independence A g e A g e W o r k e x p e r ie n c e S a la r y W o r k e x p e r ie n c e S a la r y industry University of Applied Sciences Braunschweig/Wolfenbuttel ¨ public service Data Mining – p.

28/37 .Directed acyclic graph (DAG) È ´ ´ È´ È´ È µ µ ¡ È´ µ ¡ È´ µ ¡ È´ µ ¡ È´ A B C µ µ µ D E F G University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

29/37 .Moral graph A S T L B E X D University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

30/37 .Moral graph A S A S T L B T L B E E X D X D University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

31/37 .Moral graph A S T L B S B L A T T L E B L E E X D X E D B E University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.

Hypergraph representation University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.32/37 .

33/37 . ¯ University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.Learning from data ¯ Estimation of the distributions: corresponding relative frequencies Learning structure: High computational costs.

for instance: Strategy 1: Maximum likelihood estimator: Maximize Á Ë ´ Ë µ ÐÒ´È ´ Ë Ë µµ ¯ ¯ Ë : Structure of the network network (estimated by the relative frequencies of the data for a given structure) ¯ Ë : Probability distributions in the nodes of the : Data University of Applied Sciences Braunschweig/Wolfenbuttel ¨ Data Mining – p.34/37 .Structure learning Deﬁne a measure for the quality of approximation of the data.

Structure learning

Leads to combinatorial explosion. Number of possible network structure increases exponentially with the number of attributes. Therefore: Use greedy strategies

¯

Start with network with no edges (all attributes are assumed to be independent) and add edges step by step. Add the edge that increases Á ´Ë Ë µ (maximum likelihood, K2, ) most. Or:

Start with fully connected network and remove the edge that leads to the smallest decrease of University´Ë ¨ Á of Applied Sciences Braunschweig/Wolfenbuttel Ë µ.

¯

Data Mining – p.35/37

Structure learning

Apply (conditional) independence test (for instance ¾ ) in order to decide which edges should be included in in the Bayesian network.

Strategy 2:

Compute the strength of the dependencies of pairs of attributes. Attributes with high dependency are usually neighbouring nodes in a Bayesian network. Apply heuristic search strategies (like genetic algorithms or tabu search).

Strategy 3:

In all cases: Find a compromise between a simple structure with a larger error and a complex structure with overﬁtting (use criteria like AIC, BIC oder MDL).

University of Applied Sciences Braunschweig/Wolfenbuttel ¨

Data Mining – p.36/37

Propagation

S B L A T T L E B L E

X E

D B E

In a Bayesian network, arbitrary attributes can instantiated (with a single value or a probability distribution). The computation of the (marginal) distributions of the other attributes is carried out based on message exchange algorithm.

University of Applied Sciences Braunschweig/Wolfenbuttel ¨

Data Mining – p.37/37

Cluster analysis

Aim:

Find groups (clusters) of similar objects in a data

set. Objects within the same cluster should be similar. Objects from different clusters should be dissimilar.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.1/76

**Cluster analysis: Applciations
**

¯ ¯ ¯

Customer segmentation: Find groups of customers Gene clustering: Find groups of genes with similar properties (for instance, expression proﬁles) Clustering of solar stars: Find groups of starts based on attributes like size, characteristic of the spectrum, Identiﬁcation of social or economic groups based on attributes like income, age, education level,

¯ ¯

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.2/76

Unsupervised classiﬁcation

Cluster analysis is an unsupervised classifcation technique. In contrast to supervised classifcation, the classes (clusters) are not known in advance in the data set.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.3/76

Distance measures

Cluster analysis requires a distance or (dis)similarity measure for the grouping of the data. The choice of a suitable distance measures has a strong inﬂuence on the cluster structure. For continuous attributes, a normalisation technique should be applied in order to balance the inﬂuence of the single attributes on the overall distance. (See also nearest neighbour classiﬁers.)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.4/76

Inﬂuence of scaling

m illi u n its

k ilo u n its

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.5/76

Discrete attributes

When discrete attributes are involved in the distance measure, clusters tend to be split based on the discrete attributes, since they automatically lead to well-separated clusters. Therefore, most clustering algorithm focus on continuous attributes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.6/76

**Hierarchical agglomerative clustering
**

¯

Start with every data point in its own cluster. (i.e., start with so-called singletons: single element clusters) In each step merge those two clusters that are closest to each other. Keep on merging clusters until all data points are contained in one cluster. The result is a hierarchy of clusters that can be visualized in a tree structure, a so-called dendrogram.

¯ ¯ ¯

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.7/76

**Measuring the distances
**

¯

The distance between singletons is simply the distance between the (single) data points contained in them. However: How do we compute the distance between clusters that contain more than one data point?

¯

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.8/76

¯ Average Linkage Average distance between two points of the two clusters.Measuring the distance between clusters ¯ Centroid (red) Distance between the centroids (mean value vectors) of the two clusters. ¯ Single Linkage (green) Distance between the two closest points of the two clusters.9/76 ¯ Complete Linkage University of Applied Sciences Braunschweig/Wolfenbuettel . Data Mining – p. (blue) Distance between the two farthest points of the two clusters.

10/76 .Measuring the distance between clusters University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Complete linkage leads to very compact clusters. Average linkage also tends clearly towards compact clusters. ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Measuring the distance between clusters ¯ Single linkage can “follow chains” in the data (may be desirable in certain applications).11/76 .

12/76 .Measuring the distance between clusters Single linkage Complete linkage University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Draw the data tuples at the bottom or on the left (equally spaced if they are multi-dimensional).13/76 . ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Dendrograms ¯ The cluster merging process arranges the data points in a binary tree. with the distance to the data points representing the distance between the clusters. Draw a connection between clusters that are merged.

14/76 .Dendrograms distance between clusters data tuples University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Agglomerative clustering ¯ ¯ Example: Clustering of the 1-dimensional data set ¾ ½¾ ½ ¾ ¾ . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. All three approaches to measure the distance between clusters lead to different dendrograms.15/76 .

Agglomerative clustering Centroid Single linkage Complete linkage University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.16/76 .

Heatmaps University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.17/76 .

Heatmaps ¯ ¯ ¯ One axis: Attributes Other axis: Data objects Colours (colour intensities): Attribute values University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.18/76 .

) In each step the rows and columns corresponding to the two clusters that are closest to each other are deleted. (The data points themselves are actually not needed.Implementation aspects ¯ Hierarchical agglomerative clustering can be implemented by processing the matrix ´ µ½ Ò containing the pairwise distances of the data points. Data Mining – p. A new row and column corresponding to the cluster formed by merging these clusters is added to the matrix.19/76 ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel .

20/76 University of Applied Sciences Braunschweig/Wolfenbuettel .) Data Mining – p.Implementation aspects ¯ The elements of this new row/column are computed according to £ £ « ·« ·¬ · £ « « ¬ indices of the two clusters that are merged indices of the old clusters that are not merged index of the new cluster (result of merger) parameters specifying the method (single linkage etc.

Implementation aspects ¯ The parameters deﬁning the different methods are (Ò Ò Ò are the numbers of data points in the clusters): method centroid method median method single linkage complete linkage average linkage Ward’s method Ò Ò « Ò « Ò Ò ¬ Ò ÒÒ·Ò 0 0 ·Ò ½ ¾ ½ ¾ ½ ¾ ·Ò ½ ¾ ½ ¾ ½ ¾ ½ 0 0 0 0 ½ ¾ ½ ·¾ 0 0 ·Ò Ò ·Ò Ò ·Ò ·Ò Ò ·Ò Ò ·Ò Ò ·Ò · Ò Ò Ò University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.21/76 .

Æ Draw the dendrogram and ﬁnd a good cut level.Choosing the clusters ¯ Simplest Approach: Æ Specify a minimum desired distance between clusters. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. ¯ Visual Approach: Æ Merge clusters until all data points are combined into one cluster.22/76 . Æ Stop merging clusters if the closest two clusters are farther apart than this distance. Æ Advantage: Cut need not be strictly horizontal.

Æ Several heuristic criteria exist for this step selection. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Æ Try to ﬁnd a step in which the distance between the two clusters merged is considerably larger than the distance of the previous step.Choosing the clusters ¯ More Sophisticated Approaches: Æ Analyze the sequence of distances in the merging process.23/76 .

which is not acceptable for larger data sets. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.24/76 .Agglomerative clustering: Complexity The computational complexity of agglomerative clustering is quadratic in the number of data.

Initialize the cluster centres randomly (for instance. (Intuitively: centre of gravity if each data point has unit weight. closer than any other cluster center). by randomly selecting data points). Assign each data point to the cluster centrethat is closest to it (i.) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.e.-Means clustering ¯ ¯ Choose a number of clusters to be found (user input). ¯ Data point assignment: ¯ Cluster centre update: Compute new cluster centres as the mean vectors of the assigned data points.25/76 .

. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. i.e. It can be shown that this scheme must converge.26/76 .-Means clustering ¯ Repeat these two steps (data point assignment and cluster centre update) until the clusters centres do not change anymore. the update of the cluster centres cannot go on forever.

-Means clustering Aim: Minimize the objective function Ò Ù ½ ½ under the constraints Ù ¾ ¼ ½ and Ù ½ for all ½ Ò ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.27/76 .

Assuming the assignments to the clusters to be ﬁxed. ¯ This is again a greedy algorithm. each cluster centre should be chosen as the mean vector of the data objects assigned to the cluster in order to minimize the objective function.Alternating optimization ¯ Assuming the cluster centres to be ﬁxed. Ù ½ should be chosen for the cluster to which data object Ü has the smallest distance in order to minimize the objective function.28/76 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

29/76 University of Applied Sciences Braunschweig/Wolfenbuettel .-Means clustering: Example Data set to be clustered. Choose ¿ clusters. Randomly selected data points. (From visual inspection.) Initial position of cluster centres. Data Mining – p. can be difﬁcult to determine in general.

Left: Delaunay Triangulation (The circle through the corners of a triangle does not contain another point. Data Mining – p.30/76 University of Applied Sciences Braunschweig/Wolfenbuettel .Delaunay triangulations and Voronoi diagrams ¯ ¯ ¯ Dots represent cluster centres (quantization vectors).) Right: Voronoi Diagram (Midperpendiculars of the Delaunay triangulation: boundaries of the regions of points that are closest to the enclosed cluster centre (Voronoi cells)).

31/76 .Delaunay triangulations and Voronoi diagrams ¯ Delaunay Triangulation: simple triangle (shown in grey on the left) ¯ Voronoi Diagram: midperpendiculars of the triangle’s edges (shown in blue on the left. in grey on the right) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

32/76 .-Means clustering: Example University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Convergence is achieved after only 5 steps.-Means clustering: Local minima ¯ Clustering is successful in this example: The clusters found are those that would have been formed intuitively.33/76 . With a bad initialisation clustering may fail (the alternating update process gets stuck in a local minimum).) However: The clustering result is fairly sensitive to the initial positions of the cluster centers. (This is typical: convergence is usually very fast. ¯ ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

-Means clustering: Local minima University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.34/76 .

For each training pattern ﬁnd the closest reference vector. For classiﬁed data the class may be taken into account. (reference vectors are assigned to classes) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.35/76 .Learning vector quantization (LVQ) Adaptation of reference vectors/codebook vectors ¯ ¯ ¯ ¯ Like “online” -means clustering (update after each data point). Adapt only this reference vector (winner neuron).

Learning vector quantization Attraction rule (data point and reference vector have ´Ò Ûµ ´ÓÐ µ same class) Ö Ö · ´Ô Ö ´ÓÐ µ µ (data point and reference vector have different class) Repulsion rule Ö ´Ò Ûµ Ö ´ÓÐ µ ´Ô Ö ´ÓÐ µ µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.36/76 .

37/76 University of Applied Sciences Braunschweig/Wolfenbuettel . Ö : reference vector ¼ (learning rate) Data Mining – p.Learning vector quantization Adaptation of reference vectors/codebook vectors attraction rule ¯ ¯ Ô repulsion rule : data point.

38/76 .LVQ: Learning rate decay Problem: ﬁxed learning rate can lead to oscillations Solution: time dependent learning rate ´Øµ ¼« Ø ¼ « ½ or ´Øµ ¼Ø ¼ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

39/76 University of Applied Sciences Braunschweig/Wolfenbuettel .LVQ: Example Adaptation of reference vectors / codebook vectors ¯ ¯ Left: Online training with learning rate Right: Batch training with learning rate ¼ ½. Data Mining – p. ¼¼ .

the neighbourhood (and the learning rate) become smaller over time. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. but also the reference vectors in the neighbourhood. not only the closest reference vector is updated in each step.Self-organizing maps (SOM) Self-organizing maps (also called Kohonen map) is an LVQ model ¯ ¯ where a topological structure is assumed on the reference vectors (for instance. a grid in the plane).40/76 .

41/76 .Unfolding SOM University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

42/76 .Fuzzy clustering Minimize the objective function Ò Ù Ñ ½ ½ under the constraints Ù ½ for all ½ Ò ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

called fuzziﬁer.Parameters ¯ Ù Ü ¾ ¼ ½ is the membership degree of data object to the th cluster.43/76 . controls how much clusters may overlap. ¯ is some distance measure specifying the distance between data object Ü and cluster ¯ Ñ ½. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Parameters to be optimized ¯ ¯ the membership degrees Ù the cluster parameters (not given explicitly here.44/76 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. but hidden in the distances ).

but hidden in the distances ). University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Leads to a non-linear optimization problem.45/76 .Parameters to be optimized ¯ ¯ the membership degrees Ù the cluster parameters (not given explicitly here.

Fuzzy c-means algorithm Ò Ù Ñ Ü Ú ¾ Ù ½ ½ È ½ ½ Ñ ½½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.46/76 .

47/76 .Fuzzy c-means algorithm Ò Ù Ñ Ü Ú ¾ Ù ½ ½ È ½ ½ Ñ ½½ Ò Ñ Ù Ü Ú ¾ Ú ½ ½ È ½ È ½ Ò Ò Ù Ñ Ü ÙÑ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

48/76 .Other cluster shapes ¯ ¯ ¯ ¯ ¯ ¯ ellipsoidal clusters (Gustafson/Kessel 1979) clusters as lines/planes/hyperplanes (Bock 1979. Bezdek 1981) cluster as shells of circles (Davé 1990. Krishnapuram/Nasraoui/Frigui 1992) clusters in the form of arbitrary quadrics (Krishnapuram/Frigui/Nasraoui 1991-1995) adaptable cluster volumes (Keller/Klawonn 1999) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

49/76 .Example University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

3 0.8 0.5 0.Gaussians mixture models 0.1 0 -3 -2 -1 0 1 2 3 4 Two normal distributions University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.7 0.2 0.50/76 .4 0.6 0.

1 0.51/76 .Gaussians mixture models 0.35 0.3 0.15 0.45 0.2 0.25 0.4 0.05 0 -3 -2 -1 0 1 2 3 4 Mixture model (both normal distrubutions contribute 50%) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

2 0.Gaussians mixture models 0.05 0 -3 -2 -1 0 1 2 3 4 Mixture model (one normal distrubutions contributes 10%.45 0.15 0.4 0.3 0.1 0.52/76 .35 0. the other 90%) University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.25 0.

(The probability density is a mixture of Gaussian distributions.53/76 .) ¯ Formally: We assume that the probability density can be described as ´Ü µ Ý ´Ü ½ Ý µ Ý Ô ´Ý µ¡ ´Ü Ý µ ½ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Gaussians mixture models ¯ Assumption: Data was generated by sampling a set of normal distributions.

e.Gaussians mixture models is the set of cluster parameters is a random vector that has the data space as its domain is a random variable that has the cluster indices as possible values (i. ÓÑ´ µ Ê Ñ and ÓÑ´ µ ½ ) Ô ´Ý µ is the probability that a data point belongs to (is generated by) the Ý -th component of the mixture µ ´Ü Ý is the conditional probability density function of a data point given the cluster (speciﬁed by the cluster index Ý ) Data Mining – p..54/76 University of Applied Sciences Braunschweig/Wolfenbuettel .

the maximum likelihood estimation of the parameters of a normal distribution).Expectation maximization ¯ Basic idea: ¯ Problem: Ä Do a maximum likelihood estimation of the cluster parameters. Ò Ò ´ µ ½ ´Ü µ ½Ý ½ Ô ´Ý µ¡ ´Ü Ý µ is difﬁcult to optimize. The likelihood function. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. even if one takes the natural logarithm (cf.55/76 . because Ò ÐÒ Ä´ µ ½ ÐÒ Ý Ô ´Ý µ¡ ´Ü Ý µ ½ contains the natural logarithms of complex sums.

Since the their values.Expectation maximization ¯ Approach: Assume that there are “hidden” variables stating the clusters that generated the data points Ü . we do not know ¯ Problem: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. so that the sums reduce to one term. are hidden.56/76 .

57/76 . where Ý ´Ý½ ÝÒ µ combines the values of the variables . Ä´ Ý µ Ò ½ ´Ü Ý µ Ò ½ Ô ´Ý µ¡ ´Ü Ý µ ¯ Problem: Since the unknown are hidden.Expectation maximization ¯ Formally: Maximize the likelihood of the “completed” data set ´ Ýµ. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. That is. the values Ý are µ cannot be (and thus the factors Ô ´Ý computed).

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Try to maximize the expected value of Ä´ Ý µ or ÐÒ Ä´ Ý µ (hence the name expectation maximization).Expectation maximization ¯ Approach to ﬁnd a solution nevertheless: ¯ See the as random variables (the values Ý ¯ ¯ are not ﬁxed) and consider a probability distribution over the possible values. As a consequence Ä´ Ý µ becomes a random variable. even for a ﬁxed data set and ﬁxed cluster parameters .58/76 .

59/76 . alternatively. maximize the expected log-likelihood ´ÐÒ Ä´ Ý µ µ Ý ¾½ Ò Ô ´Ý µ¡ Ò ½ ÐÒ ´Ü Ý µ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Expectation maximization ¯ Formally: Find the cluster parameters as Ö Ñ Ü ´ ÐÒ Ä´ Ý µ µ that is. maximize the expected likelihood ´Ä´ Ý µ µ Ý ¾½ Ò Ô ´Ý µ¡ Ò ½ ´Ü Ý µ or.

Use the equation as an iterative scheme.Expectation maximization ¯ Unfortunately. these functionals are still difﬁcult to optimize directly. ﬁxing in some terms (Make sure that the iteration scheme converges – at least to a local maximum.) ¯ Solution: University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.60/76 .

e. Data Mining – p..Expectation maximization ¯ Iterative scheme for expectation maximization: Choose some initial set and then compute Ö Ñ Ü ´ÐÒ Ä´ Ý µ ·½ ¼ of cluster parameters µ ´Ý µ Ð Ð Ò ÖÑÜ Ý ¾½ ¾½ Ò Ò Ô Ò ½ ÐÒ µ Ò ´Ü Ý ½ µ ´Ü Ý µ ÖÑÜ Ý Ò Ð ½ Ô ´ÝÐ ÜÐ µ ¡ ÐÒ ÐÒ µ ÖÑÜ ¯ It can be shown that each EM iteration increases the likelihood of the data and that the algorithm converges to a local maximum of the likelihood function (i. EM is a safe way to maximize the likelihood function).61/76 ½ ½ Ô ´Ü ´Ü University of Applied Sciences Braunschweig/Wolfenbuettel .

Expectation maximization Justiﬁcation of the last step on the previous slide: Ò Ý ¾½ Ý½ Ò Ð ½ Ô Ð Ð ´ÝÐ ÜÐ Ò µ Ð Ò ½ ÐÒ µ Ò ´Ü Ý ½ ½ ½ Æ Ý Ð µ Ý ½ ¡¡¡ Ò ÝÒ ½ Ð ½ Ô Ð ´ÝÐ ÜÐ Ý½ Æ Ò ÐÒ Ð Ð ´Ü ´ÝÐ ÜÐ µ µ ½ ½ Ò ÐÒ Ô ´Ü ´Ü ½ ½ Ý Ð µ ½ ¡¡¡ ÝÒ ½ Ô ½ ½ Ý½ µ ¡ ÐÒ ½ Ô Ð ´Ü Ò µ Ô Ð Ð ½ ¡¡¡ Ð ÉÒ Ý ½ È ·½ ½ ¡¡¡ ÝÐ ½Ð ½ Ð ´ÝÐ ÜÐ µ Ð ÝÒ ßÞ ´ÝÐ ÜÐ Ð µ ½ Data Mining – p.62/76 ÉÒ Ð ½ ½ University of Applied Sciences Braunschweig/Wolfenbuettel .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. as the relative probability densities of the different clusters (as speciﬁed by the cluster parameters) at the location of the data points Ü .Expectation maximization ¯ The probabilities Ô Ô ´ Ü µ are computed as ´Ü µ ´Ü ´Ü µ µ È ´Ü Ð ½ µ¡Ô ´ µ ´Ü Ð µ ¡ Ô ´Ð µ that is. ¯ The Ô ´ Ü µ are the posterior probabilities of the clusters given the data point Ü and a set of cluster parameters .63/76 .

¯ Distribute the unit weight of the data point Ü according to the above probabilities.64/76 . ½ . assign to ´Ü µ the weight Ô ´ Ü µ. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.e.Expectation maximization ¯ They can be seen as case weights of a “completed” data set: ¯ Split each data point Ü into data points ´Ü µ.. ½ . i.

EM: Cookbook recipe Core Iteration Formula Ò ·½ Ö Ñ Ü ½ ½ Ô ´ Ü µ ¡ ÐÒ ´Ü µ Expectation Step ¯ For all data points Ü : Compute for each normal distribution the ´ Ü µ probability Ô that the data point was generated from it (ratio of probability densities at the location of the data point).65/76 University of Applied Sciences Braunschweig/Wolfenbuettel . Data Mining – p. “weight” of the data point for the estimation.

EM: Cookbook recipe Maximization Step ¯ For all normal distributions: Estimate the parameters by standard maximum likelihood estimation using the probabilities (“weights”) assigned to the data points w. the distribution in the expectation step.66/76 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.r.t.

67/76 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.EM: Gaussians mixtures Expectation Step: Use Bayes’ rule to compute Ô ´Ü µ Ô ´ µ¡ ´Ü ´Ü µ µ È Ô ´ ½Ô ´ µ¡ µ¡ ´Ü ´Ü µ µ “weight” of the data point Ü for the estimation.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.68/76 .g.EM: Gaussians mixtures Maximization Step: Use maximum likelihood estimation ´Øµ µ to compute ±´Ø·½µ Ò ½ Ò ½ Ô ´Ü and ¦´ ·½µ Ø È Ò ½Ô ´Ü È ´Øµ µ ¡ Ü ´Ø·½µ Ò È ½Ô ´ Ü ´ ·½µ È ½Ô ´ Ü ¼ ½¼ Ò Ø Ò ´Øµ µ ¡ Ü ´Øµ µ Ü ´Ø·½µ ½ ½Ô ´Ü ´Øµ µ Iterate until convergence (checked. by change of mean vector). e..

EM: Technical problems ¯ If a fully general mixture of Gaussian distributions is used. This undesired result is rare.69/76 . because the algorithm gets stuck in a local optimum. the likelihood function is truly optimized if ¯ all normal distributions except one are contracted to single data points and ¯ the remaining normal distribution is the maximum likelihood estimate for the remaining data points. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

which consist mainly in reducing the degrees of freedom.EM: Technical problems ¯ Nevertheless it is recommended to take countermeasures. ¯ Fix the prior probabilities of the clusters to equal values.70/76 University of Applied Sciences Braunschweig/Wolfenbuettel . ¯ Use an isotropic variance instead of a covariance matrix. ¯ Use a diagonal instead of a general covariance matrix. Data Mining – p. like ¯ Fix the determinants of the covariance matrices to equal values.

WaveCluster. Clustering with nominal attributes: ROCK.Other clustering approaches ¯ ¯ Density-based clustering: DBSCAN. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. Grid-based clustering (division of the space into a ﬁnite number of cells): STING.71/76 . DENCLUE.

Determining the number of clusters Global validity measures: ¯ ¯ ¯ Deﬁne a measure that evaluates clustering results. Cluster the data with different numbers of clusters.72/76 . Choose the result (number of clusters) with the best validity measure. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

73/76 .Separation index To be maximized: ¾ ÑÒ ½ ÑÒ ÑÒ ´Ü Ý µ Ü ¾ cluster Ý ¾ cluster Ñ Ü diam´cluster µ where diam´cluster µ Ñ Ü ´Ü Ý µ Ü Ý ¾ cluster University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

i.74/76 . The smallest value ½ is assumed when all data are assigned with the same membership degree to all clusters. Ù ¾ ¼ ½ .e. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. to be maximized) ½ Ò È È Ò ½ Ù ¾ ¯ ¯ The largest value 1 is assumed when the partition is not fuzzy at all.Partition coefﬁcient (for fuzzy clustering.

75/76 . to be minimized) È È ½ Ò ½Ù ÐÒ´Ù µ Ò University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Partition entropy (for fuzzy clustering.

Apply this scheme with different numbers of clusters and choose the one where the coherence is best.76/76 ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . This provides a measures of how stable (how acceptable) a clustering result is. In the ideal case. subsets of approximately Apply clustering times. each pair of data objects should in always be either assigned to the same cluster or always to different clusters (coherence).Crossvalidation ¯ ¯ ¯ Partition the data into the same size. Data Mining – p. each time leaving out one of the subsets.

then she/he will probably also buy cheese. on-line shops etc.Association rule mining ¯ ¯ Association rule induction: Originally designed for market basket analysis. ¯ ¯ Example of an association rule: If a customer buys bread and wine. More speciﬁcally: Find sets of products that are frequently bought together. mail-order companies. Aims at ﬁnding patterns in the shopping behaviour of customers of supermarkets. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.1/37 .

Æ Æ Æ Æ Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. product bundling. on a catalog’s pages.Association rule mining ¯ Possible applications of found association rules: Improve arrangement of products in shelves.2/37 . Fraud detection. technical dependence analysis. Support of cross-selling (suggestion of other products). Finding business rules and detection of data quality problems.

Æ Æ Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Association rules ¯ Assessing the quality of association rules: Support of an item set: Fraction of transactions (shopping baskets/carts) that contain the item set. Support of an association rule : Either: Support of (more common: rule is correct) Or: Support of (more plausible: rule is applicable) Conﬁdence of an association rule : Support of divided by support of (estimate of È ´ µ).3/37 .

Æ Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Association rules ¯ Two step implementation of the search for association rules: Find the frequent item sets (also called large item sets).e.4/37 .. the item sets that have at least a user-deﬁned minimum support. Form rules using the frequent item sets found and select those that have at least a user-deﬁned minimum conﬁdence. i.

Efﬁcient methods to search the subset lattice are needed.Finding frequent item sets Subset lattice and a preﬁx tree for ﬁve items: ¯ It is not possible to determine the support of all possible item sets.5/37 ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . Data Mining – p. because their number grows exponentially with the number of items.

University of Applied Sciences Braunschweig/Wolfenbuettel Æ Æ Data Mining – p.Item set trees A (full) item set tree for the ﬁve items ¯ ¯ and .6/37 . Based on a global order of the items. The item sets counted in a node consist of all items labeling the edges to the node (common preﬁx) and one item following the last edge label.

¯ Size Based Pruning: Æ Prune the tree if a certain depth (a certain size of the item sets) is reached. Æ Idea: Rules with too many items are difﬁcult to interpret. so pruning is needed.Item set tree pruning In applications item set trees tend to get very large. ¯ Structural Pruning: Æ Make sure that there is only one counter for each possible item set. Æ Explains the unbalanced structure of the full item set tree.7/37 University of Applied Sciences Braunschweig/Wolfenbuettel . Data Mining – p.

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.8/37 . Æ No counters for item sets having an infrequent subset are needed.Item set tree pruning ¯ Support Based Pruning: Æ No superset of an infrequent item set can be frequent.

9/37 University of Applied Sciences Braunschweig/Wolfenbuettel . ¯ Eclat: Depth-ﬁrst search (item sets with same preﬁx). Data Mining – p.Searching the subset lattice between frequent (blue) and infrequent (white) item sets: Boundary ¯ Apriori: Breadth-ﬁrst search (item sets of same size).

Data Mining – p.10/37 University of Applied Sciences Braunschweig/Wolfenbuettel .. i. at least 3 transactions must contain the item set. Minimum support: 30%.Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ ¯ Example transaction database with 5 items and 10 transactions.e. All one item sets are frequent full second level is needed.

Data Mining – p. Better: Traverse the tree for each transaction and ﬁnd the item sets it contains (efﬁcient: can be implemented as a simple double recursive procedure).Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ Determining the support of item sets: For each item set traverse the database and count the transactions that contain it (highly inefﬁcient).11/37 University of Applied Sciences Braunschweig/Wolfenbuettel .

i. . The subtrees starting at these item sets can be pruned.e. at least 3 transactions must contain the item set.Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ ¯ Minimum support: 30%.12/37 University of Applied Sciences Braunschweig/Wolfenbuettel . Data Mining – p. Infrequent item sets: .. .

University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.13/37 .Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ Generate candidate item sets with 3 items (parents must be frequent).

14/37 . An item set with items has subsets of size ½. Æ Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The parent is only one of these subsets. check whether the candidates contain an infrequent item set.Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ Before counting.

Æ Æ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ The item sets and can be pruned. because contains the infrequent item set and contains the infrequent item set .15/37 .

Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ Only the remaining four item sets of size 3 are evaluated.16/37 . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Infrequent item set: . University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. i. at least 3 transactions must contain the item set.17/37 .Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ Minimum support: 30%..e.

Data Mining – p.Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ Generate candidate item sets with 4 items (parents must be frequent). check whether the candidates contain an infrequent item set.18/37 University of Applied Sciences Braunschweig/Wolfenbuettel . Before counting.

Apriori: Breadth ﬁrst search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ¯ ¯ ¯ The item set can be pruned.19/37 University of Applied Sciences Braunschweig/Wolfenbuettel . because it . Fourth access to the transaction database is not necessary. Data Mining – p. contains the infrequent item set Consequence: No candidate item sets with four items.

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯

Form a transaction list for each item. Here: bit vector representation. grey: item is contained in transaction white: item is not contained in transaction

¯

Æ Æ

**Transaction database is needed only once (for the single item transaction lists).
**

Data Mining – p.20/37

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯ ¯ ¯

Intersect the transaction list for item transaction lists of all other items.

with the

Count the number of set bits (containing transactions). The item set is infrequent and can be pruned.

Data Mining – p.21/37

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯ ¯

Intersect the transaction list for Ü , Ü transaction lists of

¾

with the .

**Result: Transaction lists for the item sets and .
**

Data Mining – p.22/37

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯ ¯ ¯

Intersect the transaction list for Result: Transaction list for the item set

and .

.

With Apriori this item set could be pruned before counting, because it was known that is infrequent.

Data Mining – p.23/37

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯

Backtrack to the second level of the search tree and intersect the transaction list for and . Result: Transaction list for .

¯

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.24/37

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯

¯ ¯

Backtrack to the ﬁrst level of the search tree and intersect the transaction list for with the transaction lists for , , and . Result: Transaction lists for the item sets , , and . Only one item set with sufﬁcient support prune all subtrees.

Data Mining – p.25/37

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯

Backtrack to the ﬁrst level of the search tree and intersect the transaction list for with the transaction lists for and . Result: Transaction lists for the item sets . and

Data Mining – p.26/37

¯

University of Applied Sciences Braunschweig/Wolfenbuettel

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯ ¯ ¯

Intersect the transaction list for Result: Transaction list for Infrequent item set: . .

and

.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining – p.27/37

**Eclat: Depth ﬁrst search
**

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

¯

Backtrack to the ﬁrst level of the search tree and intersect the transaction list for with the transaction list for . Result: Transaction list for the item set With this step the search is ﬁnished.

Data Mining – p.28/37

¯ ¯

.

University of Applied Sciences Braunschweig/Wolfenbuettel

**Frequent item sets
**

1 item

· : 70%

30%

2 items

:

· : 70% · : 60% · : 70%

· : 40% · : 50% · : 60% ·£ : 30% · : 40%

· : 40%

: 40%

3 items

·£ : 30% ·£ : 30% ·£ : 40%

**Types of frequent item sets
**

¯ Free Item Set:

Any frequent item set (support is higher than the minimal support).

¯ Closed Item Set

(marked with · ): A frequent item set is called closed if no superset has the same support.

¯ Maximal Item Set

**(marked with £ ): A frequent item set is called maximal if no superset is frequent.
**

Data Mining – p.29/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Generating association rules

**For each frequent item set Ë :
**

¯

Consider all pairs of sets Ë with and . Common restriction: only one item in consequent (then-part). Form the association rule conﬁdence. conf´

µ

¾

Ë

½,

i.e.

¯

**and compute its
**

µ µ

supp´ supp´

supp´Ë µ supp´ µ

¯

**Report rules with a conﬁdence higher than the minimum conﬁdence.
**

Data Mining – p.30/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Compute information gain or ¾ for antecedent (if-part) and consequent. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.31/37 .Generating association rules Further rule ﬁltering ¯ ¯ can rely on: Require a minimum difference between rule conﬁdence and consequent support.

7% 85.7% 100% 80% Data Mining – p. supp´ supp´ µ µ .3% 85.32/37 conﬁdence University of Applied Sciences Braunschweig/Wolfenbuettel . ¿¼± ¼± ± conf´ Minimum conﬁdence: 80% association rule : : : : : : support of all items 30% 50% 60% 60% 40% 40% support of antecedent 30% 60% 70% 70% 40% 50% 100% 83. µ Example: Ë .Generating association rules .

LCM. lattice / item set ¯ Finding the Frequent Item Sets ¯ Top-down search in the subset ¯ ¯ ¯ tree. Relim etc.33/37 University of Applied Sciences Braunschweig/Wolfenbuettel . ¯ Form the relevant association rules (minimum conﬁdence). Other algorithms: FP-growth. Search Tree Pruning: No superset of an infrequent item set can be frequent. Maﬁa. Data Mining – p.Summary association rules ¯ Association Rule Induction is a Two Step Process ¯ Find the frequent item sets (minimum support). H-Mine. Eclat: Depth ﬁrst search. Apriori: Breadth ﬁrst search.

34/37 .Summary association rules ¯ Generating the Association Rules ¯ ¯ Form all possible association rules from the frequent item sets. Filter “interesting” association rules. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

The additional structure leads to different tree structure. then probably happens next. but the principal algorithm remains the same. complaint. ) Association rules have the form: If and then happens.35/37 . an additional structure is imposed on the “item sets”. ¯ ¯ For instance: Customer contact (buying. ¯ The “item sets” are sequences of events. questionnaire.Structured itemsets Sometimes. ¯ Items sets are molecules: Find frequent substructures. University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

36/37 .Finding frequent molecule structures University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Search for association rules with a given conclusion part.Other applications ¯ Finding business rules and detection of data quality problems. ¯ ¯ University of Applied Sciences Braunschweig/Wolfenbuettel . Exceptions might be caused by data quality problems. Data Mining – p. ¯ ¯ Association rules with conﬁdence close to 100% could be business rules. then the customer probably buys the product.37/37 ¯ Construction of partial classiﬁers. If .

r.Subgroup discovery ¯ Classiﬁcation: Find a global model (classiﬁer) that assigns the correct class to all objects (or at least to as many as possible). A subgroup is “interesting” w.t.1/11 University of Applied Sciences Braunschweig/Wolfenbuettel . Find “interesting” subgroups in the data set. a (binary) target attribute if the distribution of the values of the target attribute differs signiﬁcantly from the distribution in the whole population. ¯ ¯ ¯ Subgroup discovers: A subgroup is usually described by a few attribute values. Data Mining – p.

2/11 University of Applied Sciences Braunschweig/Wolfenbuettel .Subgroup discovery Example: For a marketing campaign ﬁnd subgroups of customers with a high(er) chance to buy a certain product. Target attribute: buys insurance = YES Possible result of the subgroup discovery process similar to association rules with predeﬁned consequent parts of the rules: ¯ ¯ buys insurance = YES in the whole population: 5% buys insurance = YES in the subgroup Age = YOUNG & marital status = MARRIED 15% Data Mining – p.

Measures of interest for subgroups ¯ Binomial test: ÕBT ¯ Ô¼ : ¯ ¯ ¯ Ô¼ ÔÔ¼´½ Ô¼µ Ô Ö Ô Ò Æ Æ Ò relative frequency of the target variable in the whole popualtion Ô: relative frequency of the target variable in the subgroup Æ : size of the whole population Ò: size of the subgroups ¾ -test (Corresponds to the value of the University of Applied Sciences Braunschweig/Wolfenbuettel for indepenData Mining – p.3/11 dence w. to the subgroup and the target variable.r.) .t.

false positives: ÕTP ¼: weighting parameter University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.4/11 .Measures of interest for subgroups ¯ Weighted relative accuracy: ÕWRACC ´Ô Ô¼µ Æ ´½ ÔµÒ · ÔÒ Ò ¯ True positives vs.

Measures of interest for subgroups ¯ Relative gain: ÕRG ´ ¼ Ô Ô¼ Ô¼ ´½ Ô¼ µ if Ò MC otherwise MC: Minimum coverage (support) of the subgroup University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.5/11 .

6/11 . Therefore.Subgroup discovery algorithms The problem of subdiscovery is similar to ﬁnding association rules. algorithms for frequent item set and association rule mining are often adapted to subgroup discovery like ¯ ¯ Apriori-SD FP-growth University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

Depending on the method or model. Data Mining – p.7/11 University of Applied Sciences Braunschweig/Wolfenbuettel .Feature selection Very often. such attributes can lead to bad results in the data mining process. ¯ ¯ Naïve Bayes classiﬁers willbe strongly affected when attributes are highly dependent. some attributes (features) can be ¯ ¯ irrelevant or redundant. Decision trees and especially nearest neighbour classiﬁers are sensitive to irrelevant attributes.

8/11 . refers to feature selection methods that select attributes in combination with the construction of the model. Wrapping University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.Feature selection Two basic strategies for feature selection: Filtering refers to preselecting attributes before the corresponding model is built.

¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.9/11 .Feature selection Filtering methods: ¯ ¯ Independence tests ¯ between arbitrary attributes (to ﬁnd dependent attributes) ¯ between the target attribute and othet attributes (to ﬁnd irrelevant attributes) Construct a (strictly-pruned) decision tree and use only those attributes occuring in the decision tree.

10/11 .Feature selection Wrapping methods: ¯ ¯ Exhaustive search (try all combinations of attributes) Greedy search (add (delete) attributes step by step. choose in each step the one that leads to the best increase (least decreae) of the performance) ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p.

11/11 .Other topics ¯ ¯ ¯ Change detection Evolving systems (online adaptive learning for data streams) Active learning: Classiﬁcation (or regression) problems where labeling the data objects is expensive or complicated and most of the data objects are not labeled. ¯ University of Applied Sciences Braunschweig/Wolfenbuettel Data Mining – p. The active learning model tries to select those data objects for labeling which best increase the performance of the model.

Sign up to vote on this title

UsefulNot useful- data mining
- Principles of Data Mining
- data mining
- Mastering Data Mining
- Data Mining
- data structure
- Data mining & analysis
- [Fall 2011] CS-402 Data Mining - Final Exam-SUB_v03
- Book
- 7262410 Data Mining Handbook
- Data Mining:
- Introduction to Data Mining
- data structure
- Data Mining SAS
- text book-data warehouse data mining
- Clinics in Lab Medicine - Vol 28 Issue 1 - Clinical Data Mining and Warehousing (Elsevier, 2008) WW.pdf
- A Short Study On Effective Predictors Of Academic Performance Based On Course Evaluation
- Chap10 Anomaly Detection
- Seminar
- A Comparative Study on Data Mining Tools
- Seminar Data Mining
- Introduction to Data Mining
- A Exercises Solutions
- Data Mining 1
- Knowledge Discovery in Higher Education Using Association Rule Mining
- Data Mining Attrition Analysis
- Efficient Preprocessing and Patterns Identification Approach for Text Mining
- Educational Data Mining
- 16081_DATA, TEXT Mining Chap7
- Data Mining