You are on page 1of 123

Data Mining & Knowledge Discovery Course Details

• Course Code: MUH472


Prof. Dr. Nizamettin AYDIN • Course Name: Data Mining & Knowledge
Discovery (Bilgi Çıkarımı ve Veri Madenciliği)
• Nature of the course: Lecture
naydin@yildiz.edu.tr • Instructors: Nizamettin AYDIN

http://www3.yildiz.edu.tr/~naydin Email: naydin@yildiz.edu.tr

1 2

1 2

Some Recommended Books Introduction


• Introduction to Data Mining, Tan, Steinbach & • Motivation: Why data mining?
Kumar • What is data mining?
• Data Mining: The Textbook, Charu C. Aggarwal • Data Mining: On what kind of data?
• Data Mining- Concepts, Models, Methods, and • Data mining functionality
Algorithms, Mehmed Kantardzic • Are all the patterns interesting?
• Principles of Data Mining, Max Bramer • Classification of data mining systems
• Data Mining Techniques, Michael Berry and • Data Mining Task Primitives
Gordon Linoff • Integration of data mining system with a DB
• Introduction to Algorithms for Data Mining and and DW System
Machine Learning, Xin-She Yang • Major issues in data mining
3 4

3 4

Large-scale Data is Everywhere! Why Data Mining?


• There has been enormous • The Explosive Growth of Data: from terabytes to
data growth in both petabytes
commercial and scientific – Data collection and data availability
databases due to advances • Automated data collection tools, database systems, Web,
E-Commerce
in data generation and Cyber Security
computerized society
collection technologies – Major sources of abundant data
• New mantra • Business: Web, e-commerce, transactions, stocks, …
– Gather whatever data you • Science: Remote sensing, bioinformatics, scientific
can whenever and simulation, …
wherever possible. Traffic Patterns Social Networking: Twitter
• Society and everyone: news, digital cameras,
• Expectations • We are drowning in data but starving for
– Gathered data will have knowledge!
value either for the
purpose collected or for a • “Necessity is the mother of invention”—Data
purpose not envisioned. Sensor Networks Computational Simulations
mining—Automated analysis of massive data sets
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Origins of Data Mining Evolution of Database Technology
• Draws ideas from machine learning/AI, pattern • 1960s:
– Data collection, database creation, IMS and network DBMS
recognition, statistics, and database systems
• 1970s:
– Relational data model, relational DBMS implementation
• Traditional techniques may be unsuitable due to • 1980s:
– RDBMS, advanced data models (extended-relational, deductive, etc.)
data that is – Application-oriented DBMS (spatial, scientific, engineering, etc.)
– Large-scale • 1990s:
– High dimensional – Data mining, data warehousing, multimedia databases, and Web
databases
– Heterogeneous
• 2000s
– Complex – Stream data management and mining
– Distributed – Data mining and its applications
– Web technology (XML, data integration) and global information
systems
• A key component of the emerging field of data
science and data-driven discovery
7 8

7 8

Why Data Mining? Commercial Viewpoint Why Data Mining? Scientific Viewpoint
• Lots of data is being collected • Data collected and stored at
and warehoused enormous speeds
– remote sensors on a satellite
– Web data • NASA EOSDIS archives over
• Google has Peta Bytes of web data petabytes of earth science data
/ year fMRI Data from Brain
• Facebook has billions of active users Sky Survey Data
– telescopes scanning the skies
– purchases at department/ • Sky survey data
grocery stores, e-commerce – High-throughput biological
– Amazon handles millions of visits/day data
– Bank/Credit Card transactions – scientific simulations
• terabytes of data generated in a
• Computers have become cheaper and more few hours
Surface Temperature of
Earth
powerful • Data mining helps scientists Gene Expression Data

• Competitive Pressure is Strong – in automated analysis of


massive datasets
– Provide better, customized services for an edge (e.g. in – in hypothesis formation
Customer Relationship Management)
9 10

9 10

Great opportunities to improve productivity in all walks of life Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Finding alternative/ green energy sources Reducing hunger and poverty by


increasing agriculture production

11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
What Is Data Mining? What is Data Mining?
• Data mining (knowledge discovery from data) • Many Definitions
– Extraction of interesting (non-trivial, implicit, – Non-trivial extraction of implicit, previously unknown
previously unknown and potentially useful) patterns or and potentially useful information from data
knowledge from huge amount of data
– Exploration & analysis, by automatic or semi-
• Alternative names automatic means, of large quantities of data in order to
– Knowledge discovery (mining) in databases (KDD), discover meaningful patterns
knowledge extraction, data/pattern analysis, data
The process of knowledge discovery in databases (KDD):
archeology, data dredging, information harvesting,
business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems

13 14

13 14

Potential Applications Data Mining Tasks…


• Data analysis and decision support • Prediction Methods
– Market analysis and management – Use some variables to predict unknown or future
• Target marketing, customer relationship management (CRM), values of other variables.
market basket analysis, cross selling, market segmentation
• attribute to be predicted is known as the target or dependent
– Risk analysis and management variable,
• Forecasting, customer retention, improved underwriting,
quality control, competitive analysis • attributes used for making the prediction are known as the
explanatory or independent variables.
– Fraud detection and detection of unusual patterns
(outliers) • Description Methods
• Other Applications – Find human-interpretable patterns (correlations, trends,
– Text mining (news group, email, documents) and Web clusters, trajectories, and anomalies) that describe the
mining data.
– Stream data mining • exploratory in nature and frequently require postprocessing
techniques to validate and explain the results
– Bioinformatics and bio-data analysis
15 16

15 16

…Data Mining Tasks … Predictive Modeling


• Four of the core data mining tasks • refers to the task of building a model for the target
variable as a function of the explanatory variables
• Two types of predictive modeling tasks:
Data
Tid Refund Marital

1 Yes
Status

Single
Taxable
Income Cheat

125K No
– Classification
2
3
No
No
Married
Single
100K
70K
No
No • used for discrete target variables
4 Yes Married 120K No
5
6
No
No
Divorced 95K
Married 60K
Yes
No
– For example, predicting whether a web user will make a purchase
7
8
Yes
No
Divorced 220K
Single 85K
No
Yes
at an online bookstore
– Regression
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No

• used for continuous target variables


12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No

10
15 No Single 90K Yes
– For example, forecasting the future price of a stock
• The goal of both tasks is to learn a model that
Milk
minimizes the error between the predicted and
true values of the target variable
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Predictive Modeling: Classification Classification Example
• Find a model for class attribute as a function # years at
Level of Credit
of the values of other attributes Model for predicting credit Tid Employed
Education
present
address
Worthy
1 Yes Undergrad 7 ?
worthiness 2 No Graduate 3 ?
Employed
# years at 3 Yes High School 2 ?
Level of Credit
Tid Employed present
Education Worthy … … … … …
No Yes address 10

1 Yes Graduate 5 Yes


2 Yes High School 2 No
No Education 3 No Undergrad 1 No
Class 4 Yes High School 10 Yes
# years at { High school,
Level of Credit Graduate … … … … … Test
Tid Employed present Undergrad }
Education Worthy
10

address Set
1 Yes Graduate 5 Yes
Number of Number of
2 Yes High School 2 No years years
3 No Undergrad 1 No
Training
Learn
4 Yes High School 10 Yes > 3 yr < 3 yr > 7 yrs < 7 yrs Model
… … … … …
Set Classifier
10

Yes No Yes No

19 20

19 20

Examples of Classification Task Classification: Application 1


• Classifying credit card transactions • Fraud Detection
as legitimate or fraudulent – Goal:
• Classifying land covers (water bodies, • Predict fraudulent cases in credit card transactions.
urban areas, forests, etc.) using satellite
data – Approach:
• Use credit card transactions and the information on its
• Categorizing news stories as finance, account-holder as attributes.
weather, entertainment, sports, etc – When does a customer buy, what does (s)he buy, how often
• Identifying intruders in the cyberspace (s)he pays on time, etc

• Predicting tumor cells as benign or • Label past transactions as fraud or fair transactions.
– This forms the class attribute.
malignant
• Learn a model for the class of the transactions.
• Classifying secondary structures of
• Use this model to detect fraud by observing credit card
protein as alpha-helix, beta-sheet, or transactions on an account.
random coil
21 22

21 22

Classification: Application 2 Classification: Application 3


• Churn prediction for telephone customers • Sky Survey Cataloging
– Goal: – Goal:
• To predict whether a customer is likely to be lost to a • To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
competitor.
survey images (from Palomar Observatory).
– Approach: – 3000 images with 23,040 x 23,040 pixels per image.
• Use detailed record of transactions with each of the past – Approach:
and present customers, to find attributes. • Segment the image.
– How often the customer calls, where he calls, what time-of-the • Measure image attributes (features) - 40 of them per
day he calls most, his financial status, marital status, etc. object.
– Label the customers as loyal or disloyal. • Model the class based on these features.
– Find a model for loyalty. • Success Story: Could find 16 new high red-shift quasars,
From [Berry & Linoff] Data Mining Techniques, 1997 some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Classifying Galaxies Regression
Courtesy: http://aps.umn.edu
• Predict a value of a given continuous valued
Early Class: Attributes: variable based on the values of other variables,
• Stages of Formation • Image features,
• Characteristics of light
assuming a linear or nonlinear model of
waves received, etc. dependency.
Intermediate
• Extensively studied in statistics, neural network
fields.
Late
• Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
Data Size: – Predicting wind velocities as a function of temperature,
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB humidity, air pressure, etc.
• Image Database: 150 GB
– Time series prediction of stock market indices.
25 26

25 26

Clustering Applications of Cluster Analysis


• Understanding
• Finding groups of objects such that the objects – Custom profiling for targeted
in a group will be similar (or related) to one marketing
– Group related documents for
another and different from (or unrelated to) the browsing
objects in other groups – Group genes and proteins that
Inter-cluster have similar functionality
Intra-cluster distances are – Group stocks with similar price
distances are maximized fluctuations
minimized • Summarization
– Reduce the size of large data sets
Clusters for Raw SST and Raw NPP
90

60
Use of K-means to
Land Cluster 2 partition Sea Surface
30

Land Cluster 1
Temperature (SST) and
latitude

0 Net Primary Production


-30
Ice or No NPP
(NPP) into clusters that
Sea Cluster 2 reflect the Northern and
-60

Sea Cluster 1
Southern Hemispheres.
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
longitude
27 28

27 28

Clustering: Application 1 Clustering: Application 2


• Market Segmentation: • Document Clustering:
– Goal: – Goal:
• subdivide a market into distinct subsets of customers • To find groups of documents that are similar to each
where any subset may conceivably be selected as a
other based on the important terms appearing in them.
market target to be reached with a distinct marketing
mix. – Approach:
– Approach: • To identify frequently occurring terms in each
• Collect different attributes of customers based on their document.
geographical and lifestyle related information. – Form a similarity measure based on the frequencies of different
• Find clusters of similar customers. terms. Use it to cluster.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
Enron email dataset
different clusters.

29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Association Rule Discovery: Definition Association Analysis: Applications
• Given a set of records each of which contain • Market-basket analysis
some number of items from a given collection – Rules are used for sales promotion, shelf
– Produce dependency rules which will predict management, and inventory management
occurrence of an item based on occurrences of • Telecommunication alarm diagnosis
other items.
– Rules are used to find combination of alarms that
TID Items occur together frequently in the same time period
1 Bread, Coke, Milk
2 Beer, Bread
Rules Discovered:
{Milk} --> {Coke}
• Medical Informatics
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer} – Rules are used to find combination of patient
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk symptoms and test results associated with certain
diseases

31 32

31 32

Association Analysis: Applications Deviation/Anomaly/Change Detection


• An Example Subspace Differential • Detect significant deviations
from normal behavior
Coexpression Pattern from lung cancer dataset
Three lung cancer datasets [Bhattacharjee et al.
2001], [Stearman et al. 2005], [Su et al. 2007]
• Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection
– Identify anomalous behavior
from sensor networks for
monitoring and surveillance.
– Detecting changes in the
global forest cover.
Enriched with the TNF/NFB signaling pathway
which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)

[Fang et al PSB 2010]

33 34

33 34

Motivating Challenges Motivating Challenges


• Scalability • Heterogeneous and Complex Data
– Because of advances in data generation and – Traditional data analysis methods often deal with
collection, data sets with sizes of terabytes, data sets containing attributes of the same type,
petabytes, or even exabytes are becoming common. either continuous or categorical.
• If data mining algorithms are to handle these massive
data sets, they must be scalable – As the role of data mining in business, science,
• High Dimensionality medicine, and other fields has grown, so has the
need for techniques that can handle heterogeneous
– It is now common to encounter data sets with
hundreds or thousands of attributes instead of the attributes.
handful common a few decades ago – Recent years have also seen the emergence of more
– Data sets with temporal or spatial components also complex data objects.
tend to have high dimensionality
35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Motivating Challenges Motivating Challenges
• Data Ownership and Distribution • Non-traditional Analysis
– Sometimes, the data needed for an analysis is not – The traditional statistical approach is based on a
stored in one location or owned by one organization. hypothesize-and-test paradigm.
• Instead, the data is geographically distributed among • an experiment is designed to gather the data, and then
resources belonging to multiple entities.
the data is analyzed with respect to the hypothesis
– This requires the development of distributed data
– Current data analysis tasks often require the
mining techniques.
generation and evaluation of thousands of
• The key challenges faced by distributed data mining
algorithms include the following: hypotheses, and consequently, the development of
– how to reduce the amount of communication needed to perform some data mining techniques has been motivated
the distributed computation, by the desire to automate the process of hypothesis
– how to effectively consolidate the data mining results obtained
from multiple sources, generation and evaluation.
– how to address data security and privacy issues.
37 38

37 38

Copyright 2000 N. AYDIN. All rights


reserved. 7
Data Mining & Knowledge Discovery Data Mining & Knowledge Discovery

Prof. Dr. Nizamettin AYDIN Information Systems:


naydin@yildiz.edu.tr

http://www3.yildiz.edu.tr/~naydin Fundamentals

1 2

1 2

Informatics Informatics - Etymology


• The term informatics broadly describes the • In 1956 the German computer scientist Karl
study and practice of Steinbuch coined the word Informatik
– creating, • [Informatik: Automatische Informationsverarbeitung ("Informatics:
Automatic Information Processing")]
– storing,
• The French term informatique was coined in
– finding,
1962 by Philippe Dreyfus
– manipulating • [Dreyfus, Phillipe. L’informatique. Gestion, Paris, June 1962, pp.
– sharing 240–41]

information. • The term was coined as a combination of


information and automatic to describe the
science of automating information interactions
3 4

3 4

Informatics - Etymology Data - Information - Knowledge


• The morphology—informat-ion + -ics—uses • Data
• the accepted form for names of sciences, – unprocessed facts and figures without any added
– as conics, linguistics, optics, interpretation or analysis.
• {The price of crude oil is $80 per barrel.}
• or matters of practice,
• Information
– as economics, politics, tactics
– data that has been interpreted so that it has meaning
• linguistically, the meaning extends easily for the user.
– to encompass both • {The price of crude oil has risen from $70 to $80 per
• the science of information barrel}
• the practice of information processing. – [gives meaning to the data and so is said to be information to
someone who tracks oil prices.]

5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Data - Information - Knowledge Converting data into information
• Knowledge
– a combination of information, experience and
insight that may benefit the individual or the
organisation.
• {When crude oil prices go up by $10 per barrel, it's • Data becomes information when it is applied to
likely that petrol prices will rise by 2p per litre.} some purpose and adds value for the recipient.
– [This is knowledge]
– For example a set of raw sales figures is data.
– [insight: the capacity to gain an accurate and deep • For the Sales Manager tasked with solving a problem of poor sales
understanding of someone or something; an accurate and deep in one region, or deciding the future focus of a sales drive, the raw
understanding] data needs to be processed into a sales report.
– It is the sales report that provides information.

7 8

7 8

Converting data into information Converting data into information


• Collecting data is expensive • To be useful, data must satisfy a number of
– you need to be very clear about why you need it conditions. It must be:
and how you plan to use it. – relevant to the specific purpose
– One of the main reasons that organisations collect – complete
data is to monitor and improve performance. – accurate
• if you are to have the information you need for control
and performance improvement, you need to:
– timely
– collect data on the indicators that really do affect performance • data that arrives after you have made your decision is of
– collect data reliably and regularly no value
– be able to convert data into the information you need.

9 10

9 10

Converting data into information Converting information to knowledge


– in the right format
• information can only be analysed using a spreadsheet if
all the data can be entered into the computer system
– available at a suitable price
• the benefits of the data must merit the cost of collecting • Ultimately the tremendous amount of
or buying it.
information that is generated is only useful if it
• The same criteria apply to information. can be applied to create knowledge within the
– It is important organisation.
• to get the right information
• to get the information right • There is considerable blurring and confusion
between the terms information and knowledge.
11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Converting information to knowledge Converting information to knowledge
• think of knowledge as being of two types: • Clearly, both types of knowledge are essential
– Formal, explicit or generally available knowledge. for the organisation.
• This is knowledge that has been captured and used to • Information on its own will not create a
develop policies and operating procedures for example.
knowledge-based organisation
– Instinctive, subconscious, tacit or hidden
– but it is a key building block.
knowledge.
• Within the organisation there are certain people who • The right information fuels the development of
hold specific knowledge or have the 'know how' intellectual capital
– {"I did something very similar to that last year and this
happened….."} – which in turns drives innovation and performance
improvement.

13 14

13 14

Analysis Definition(s) of system


• The terms analysis and synthesis come from Greek • A system can be broadly defined as an integrated set of
– they mean respectively "to take apart" and "to put together". elements that accomplish a defined objective.
– These terms are in scientific disciplines from mathematics • People from different engineering disciplines have
and logic to economy and psychology to denote similar different perspectives of what a "system" is.
investigative procedures.
• For example,
• Analysis is defined as the procedure by which we
– software engineers often refer to an integrated set of computer
break down an intellectual or substantial whole into programs as a "system"
parts. – electrical engineers might refer to complex integrated circuits
• Synthesis is defined as the procedure by which we or an integrated set of electrical units as a "system"
combine separate elements or components in order to • As can be seen, "system" depends on one’s perspective,
form a coherent whole. and the “integrated set of elements that accomplish a
defined objective” is an appropriate definition.
15 16

15 16

Definition(s) of system Definition(s) of system


• A system is an assembly of parts where: • A system is defined as multiple parts working
– The parts or components are connected together in an organized way.
– The parts or components are affected by being in the system (and are
together for a common purpose or goal.
changed by leaving it). • Systems can be large and complex
– The assembly does something.
– The assembly has been identified by a person as being of special – such as the air traffic control system or our global
interest. telecommunication network.
• Any arrangement which involves the handling, processing or
manipulation of resources of whatever type can be represented
• Small devices can also be considered as
as a system. systems
• Some definitions on online dictionaries – such as a pocket calculator, alarm clock, or 10-
– http://en.wikipedia.org/wiki/System speed bicycle.
– http://dictionary.reference.com/browse/systems
– http://www.businessdictionary.com/definition/system.html
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Definition(s) of system A systems model
• Systems have inputs, processes, and outputs.
• When feedback (direct or indirect) is involved,
that component is also important to the
operation of the system.
• To explain all this, systems are usually
explained using a model.
• A model helps to illustrate the major elements
and their relationship, as illustrated in the next
slide

19 20

19 20

Information Systems Information Technology


• The ways that organizations • Components that implement information
– Store systems,
– Move – Hardware
– Organize • physical tools: computer and network hardware, but also
low-tech things like pens and paper
– Process
– Software
their information • (changeable) instructions for the hardware
– People
– Procedures
• instructions for the people
– Data/databases
21 22

21 22

Digital System A Digital Computer Example


• Takes a set of discrete information (inputs) and
discrete internal information (system state) and Memory

generates a set of discrete information (outputs).

Discrete Control
Discrete CPU unit Datapath
Inputs Information
Processing Discrete
Outputs Inputs:
System Outputs: CRT,
Keyboard,
LCD, modem,
mouse, modem, Input/Output
speakers
microphone
System State
Synchronous or
Asynchronous?
23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Signal A typical measurement system

• An information variable represented by physical


quantity.
• For digital systems, the variable takes on discrete
values.
• Two level, or binary values are the most prevalent
values in digital systems.
• Binary values are represented abstractly by:
– digits 0 and 1
– words (symbols) False (F) and True (T)
– words (symbols) Low (L) and High (H)
– and words On and Off.
• Binary values are represented by values or ranges of
values of physical quantities

25 26

25 26

Transducers

• The analogue signal


• A “transducer” is a device that converts energy from one
– a continuous variable defined with infinite
form to another.
precision
• In signal processing applications, the purpose of energy
conversion is to transfer information, not to transform is converted to a discrete sequence of measured
energy. values which are represented digitally
• In physiological measurement systems, transducers may be • Information is lost in converting from analogue
– input transducers (or sensors) to digital, due to:
• they convert a non-electrical energy into an electrical signal. – inaccuracies in the measurement
• for example, a microphone.
– uncertainty in timing
– output transducers (or actuators)
• they convert an electrical signal into a non-electrical energy. – limits on the duration of the measurement
• For example, a speaker.
• These effects are called quantisation errors
27 28

27 28

Signal Encoding: Analog-to Digital Conversion

• The continuous analogue signal has to be held before Continuous (analog) signal ↔ Discrete signal
it can be sampled x(t) = f(t) ↔ Analog to digital conversion ↔ x[n] = x [1], x [2], x [3], ... x[n]

10 10
Continuous
8
9

6
x(t)

8
4
7

• Otherwise, the signal would be changing during the 2

6
x(t) and x(n)

measurement 0 2 4 6
Time (sec)
8 10
5 Digitization

• Only after it has been held can the signal be measured, 10

8
Discrete
4

and the measurement converted to a digital value 6


3
x(n)

2
4

1
2

0 0
0 2 4 6 8 10 0 2 4 6 8 10
Sample Number Sample Number
29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Analog-to Digital Conversion

• ADC consists of four steps to digitize an analog


signal:
1. Filtering
2. Sampling
3. Quantization
4. Binary encoding
▪ Before we sample, we have to filter the signal to
limit the maximum frequency of the signal as it
affects the sampling rate.
▪ Filtering should ensure that we do not distort the
signal, ie remove high frequency components
that affect the signal shape.
31 32

31 32

Sampling Sampling
• The sampling results in a discrete set of digital • Analog signal is sampled every TS secs.
numbers that represent measurements of the signal • Ts is referred to as the sampling interval.
– usually taken at equal intervals of time • fs = 1/Ts is called the sampling rate or sampling
• Sampling takes place after the hold frequency.
– The hold circuit must be fast enough that the signal is not • There are 3 sampling methods:
changing during the time the circuit is acquiring the signal – Ideal - an impulse at each sampling instant
value – Natural - a pulse of short width with varying amplitude
• We don't know what we don't measure – Flattop - sample and hold, like natural but with single
amplitude value
• In the process of measuring the signal, some
• The process is referred to as pulse amplitude
information is lost modulation PAM and the outcome is a signal with
analog (non integer) values
33 34

33 34

Recovery of a sampled sine wave for different sampling rates

35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
37 38

37 38

39 40

39 40

Sampling Theorem Nyquist sampling rate for low-pass and bandpass signals

Fs  2fm

According to the Nyquist theorem, the


sampling rate must be at least 2 times the
highest frequency contained in the signal.

41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Quantization Quantization Levels

• Sampling results in a series of pulses of varying


amplitude values ranging between two limits: a • The midpoint of each zone is assigned a
min and a max.
value from 0 to L-1 (resulting in L values)
• The amplitude values are infinite between the two
limits. • Each sample falling in a zone is then
• We need to map the infinite amplitude values onto approximated to the value of the midpoint.
a finite set of known values.
• This is achieved by dividing the distance between
min and max into L zones, each of height 
 = (max - min)/L

43 44

43 44

Quantization Zones Assigning Codes to Zones


• Each zone is then assigned a binary code.
• Assume we have a voltage signal with • The number of bits required to encode the zones,
amplitutes Vmin=-20V and Vmax=+20V. or the number of bits per sample as it is commonly
referred to, is obtained as follows:
• We want to use L=8 quantization levels.
nb = log2 L
• Zone width  = (20 - -20)/8 = 5 • Given our example, nb = 3
• The 8 zones are: -20 to -15, -15 to -10, -10 • The 8 zone (or level) codes are therefore: 000,
to -5, -5 to 0, 0 to +5, +5 to +10, +10 to 001, 010, 011, 100, 101, 110, and 111
+15, +15 to +20 • Assigning codes to zones:
• The midpoints are: -17.5, -12.5, -7.5, -2.5, – 000 will refer to zone -20 to -15
2.5, 7.5, 12.5, 17.5 – 001 to zone -15 to -10, etc.

45 46

45 46

Quantization and encoding of a sampled signal


Quantization Error

• When a signal is quantized, we introduce an error


– the coded signal is an approximation of the actual
amplitude value.
• The difference between actual and coded value
(midpoint) is referred to as the quantization error.
• The more zones, the smaller 
– which results in smaller errors.
• BUT, the more zones the more bits required to
encode the samples
– higher bit rate

47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Analog-to-digital Conversion Sampling related concepts

Example An 12-bit analog-to-digital converter (ADC) • Over/exact/under sampling


advertises an accuracy of ± the least significant bit (LSB). • Regular/irregular sampling
If the input range of the ADC is 0 to 10 volts, what is the
accuracy of the ADC in analog volts? • Linear/Logarithmic sampling
• Aliasing
Solution:
If the input range is 10 volts then the analog voltage represented by the LSB • Anti-aliasing filter
would be:
• Image
V max 10 10
VLSB = = = = .0024 volts • Anti-image filter
2 Nu bits 212 4096
Hence the accuracy would be ± 0.0024 volts.

49 50

49 50

Digital data: end product of A/D conversion and related


Steps for digitization/reconstruction of a signal concepts

• Bit: least digital information, binary 1 or 0


• Band limiting (LPF) • D/A converter • Nibble: 4 bits
• Sampling / Holding • Sampling / • Byte: 8 bits, 2 nibbles
• Quantization Holding
• Word: 16 bits, 2 bytes, 4 nibbles
• Coding • Image rejection
• Some jargon:
– integer, signed integer, long integer, 2s
These are basic steps These are basic steps complement, hexadecimal, octal, floating point,
for A/D conversion for reconstructing etc.
a sampled digital
signal
51 52

51 52

Measures of capacity and speed in Computers

Special Powers of 10 and 2 :

• Kilo- (K) = 1 thousand = 103 and 210


• Mega- (M) = 1 million = 106 and 220
• Giga- (G) = 1 billion = 109 and 230
• Tera- (T) = 1 trillion = 1012 and 240
• Peta- (P) = 1 quadrillion = 1015 and 250

Whether a metric refers to a power of ten or a power of


two typically depends upon what is being measured.

53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Example Measures of time and space

• Hertz = clock cycles per second (frequency)


– 1MHz = 1,000,000Hz • Milli- (m) = 1 thousandth = 10 -3
– Processor speeds are measured in MHz or GHz.
• Micro- () = 1 millionth = 10 -6
• Byte = a unit of storage
• Nano- (n) = 1 billionth = 10 -9
– 1KB = 210 = 1024 Bytes
• Pico- (p) = 1 trillionth = 10 -12
– 1MB = 220 = 1,048,576 Bytes
– Main memory (RAM) is measured in MB • Femto- (f) = 1 quadrillionth = 10 -15
– Disk storage is measured in GB for small systems, TB
for large systems.

55 56

55 56

Data types What kinds of data do we need to represent?

• Our first requirement is to find a way to represent information Numbers


(data) in a form that is mutually comprehensible by human and signed, unsigned, integers, floating point, complex, rational, irrational, …
machine. Text
– Ultimately, we need to develop schemes for representing all characters, strings, …
conceivable types of information - language, images, Images
actions, etc. pixels, colors, shapes, …
– Specifically, the devices that make up a computer are Sound
switches that can be on or off, i.e. at high or low voltage. Logical
– Thus they naturally provide us with two symbols to work true, false
with: Instructions
• we can call them on and off, or 0 and 1.

Data type:
– representation and operations within the computer
57 58

57 58

Number Systems – Representation Decimal Numbers


• Positive radix, positional number systems • “decimal” means that we have ten digits to use in our

• A number with radix r is represented by a representation


– the symbols 0 through 9
string of digits:
• What is 3546?
An - 1 An - 2 … A1 A0 . A- 1 A- 2 … A- m + 1 A- m
– it is three thousands plus five hundreds plus four tens plus six
in which 0  Ai < r and . is the radix point. ones.
• The string of digits represents the power series: – i.e. 3546 = 3×103 + 5×102 + 4×101 + 6×100

(Number)r = ( i=n-1
Ai r i + ) ( j=-1
)
Aj r j
• How about negative numbers?
– we use two more symbols to distinguish positive and negative:
i=0 j=-m
(Integer Portion) + (Fraction Portion) + and -
59 60

59 60

Copyright 2000 N. AYDIN. All rights


reserved. 10
Decimal Numbers Unsigned Binary Integers
• “decimal” means that we have ten digits to use in our
Y = “abc” = a.22 + b.21 + c.20
representation (the symbols 0 through 9)
(where the digits a, b, c can each take on the values of 0 or 1 only)
• What is 3546?
N = number of bits 3-bits 5-bits 8-bits
– it is three thousands plus five hundreds plus four tens plus
six ones. Range is: 0 000 00000 00000000
– i.e. 3546 = 3.103 + 5.102 + 4.101 + 6.100 0  i < 2N - 1 1 001 00001 00000001
• How about negative numbers? 2 010 00010 00000010
Problem:
– we use two more symbols to distinguish positive and 3 011 00011 00000011
• How do we represent
negative:
negative numbers? 4 100 00100 00000100
+ and -
61 62

61 62

Signed Binary Integers Limitations of integer representations


-2s Complement representation-
• Transformation -16 10000
• Most numbers are not integer!
– Even with integers, there are two other considerations:
– To transform a into -a, invert all … …
bits in a and add 1 to the result • Range:
-3 11101 – The magnitude of the numbers we can represent is
-2 11110 determined by how many bits we use:
• e.g. with 32 bits the largest number we can represent is about +/- 2
Range is: -1 11111 billion, far too small for many purposes.
-2N-1 < i < 2N-1 - 1
0 00000 • Precision:
Advantages: +1 00001 – The exactness with which we can specify a number:
• Operations need not check the • e.g. a 32 bit number gives us 31 bits of precision, or roughly 9
+2 00010 figure precision in decimal repesentation.
sign +3 00011 • We need another data type!
• Only one representation for zero … …
• Efficient use of all the bits
+15 01111
63 64

63 64

Real numbers Real numbers in binary


• Our decimal system handles non-integer real numbers • We mimic the decimal floating point notation to create a
by adding yet another symbol - the decimal point (.) to “hybrid” binary floating point number:
make a fixed point notation: – We first use a “binary point” to separate whole numbers from
– e.g. 3456.78 = 3.103 + 4.102 + 5.101 + 6.100 + 7.10-1 + 8.10-2 fractional numbers to make a fixed point notation:
• e.g. 00011001.110 = 1.24 + 1.103 + 1.101 + 1.2-1 + 1.2-2 => 25.75
• The floating point, or scientific, notation allows us to (2-1 = 0.5 and 2-2 = 0.25, etc.)

represent very large and very small numbers (integer or


– We then “float” the binary point:
real), with as much or as little precision as needed:
• 00011001.110 => 1.1001110 x 24
– Unit of electric charge e = 1.602 176 462 x 10-19 Coulomb mantissa = 1.1001110, exponent = 4
– Volume of universe = 1 x 1085 cm3
• the two components of these numbers are called the mantissa and the – Now we have to express this without the extra symbols ( x, 2, . )
exponent • by convention, we divide the available bits into three fields:
sign, mantissa, exponent
65 66

65 66

Copyright 2000 N. AYDIN. All rights


reserved. 11
IEEE-754 fp numbers - 1 IEEE-754 fp numbers - 2
s biased exp. fraction
23 bits
• Example: Find the corresponding fp representation of 25.75
32 bits: 1 8 bits
• 25.75 => 00011001.110 => 1.1001110 x 2 4
N= (-1)s x 1.fraction x 2(biased exp. – 127) • sign bit = 0 (+ve)
• Sign: 1 bit • normalized mantissa (fraction) = 100 1110 0000 0000 0000 0000

• Mantissa: 23 bits • biased exponent = 4 + 127 = 131 => 1000 0011


• so 25.75 => 0 1000 0011 100 1110 0000 0000 0000 0000 => x41CE0000
– We “normalize” the mantissa by dropping the leading 1 and
recording only its fractional part (why?) • Values represented by convention:
• Exponent: 8 bits – Infinity (+ and -): exponent = 255 (1111 1111) and fraction = 0
– In order to handle both +ve and -ve exponents, we add 127 – NaN (not a number): exponent = 255 and fraction  0
to the actual exponent to create a “biased exponent”: – Zero (0): exponent = 0 and fraction = 0
• 2-127 => biased exponent = 0000 0000 (= 0)
• note: exponent = 0 => fraction is de-normalized, i.e no hidden 1
• 20 => biased exponent = 0111 1111 (= 127)
• 2+127 => biased exponent = 1111 1110 (= 254)

67 68

67 68

IEEE-754 fp numbers - 3 Binary Numbers and Binary Coding


• Double precision (64 bit) floating point • Flexibility of representation
s biased exp. fraction – Within constraints below, can assign any binary
combination (called a code word) to any data as long as
64 bits: 1 11 bits 52 bits data is uniquely encoded.
• Information Types
N = (-1)s x 1.fraction x 2(biased exp. – 1023) – Numeric
• Must represent range of data needed
⚫ Range & Precision: • Very desirable to represent data such that simple, straightforward
 32 bit: computation for common arithmetic operations permitted
▪ mantissa of 23 bits + 1 => approx. 7 digits decimal • Tight relation to binary numbers
▪ 2+/-127 => approx. 10+/-38 – Non-numeric
 64 bit: • Greater flexibility since arithmetic operations not applied.
▪ mantissa of 52 bits + 1 => approx. 15 digits decimal • Not tied to binary numbers
▪ 2+/-1023 => approx. 10+/-306

69 70

69 70

Non-numeric Binary Codes Number of Bits Required

• Given n binary digits (called bits), a binary code is a • Given M elements to be represented by a
mapping from a set of represented elements to a binary code, the minimum number of bits, n,
subset of the 2n binary numbers. needed, satisfies the following relationships:
2n > M > 2(n – 1)
• Example: A Color Binary Number n =log2 M where x , called the ceiling
binary code Red 000 function, is the integer greater than or equal to x.
for the seven Orange 001
Yellow 010 • Example: How many bits are required to
colors of the
Green 011 represent decimal digits with a binary code?
rainbow Blue 101
– 4 bits are required (n =log2 9 = 4)
• Code 100 is Indigo 110
Violet 111
not used

71 72

71 72

Copyright 2000 N. AYDIN. All rights


reserved. 12
Number of Elements Represented Binary Coded Decimal (BCD)
• Given n digits in radix r, there are rn distinct • In the 8421 Binary Coded Decimal (BCD)
elements that can be represented. representation each decimal digit is converted to its 4-
• But, you can represent m elements, m < rn bit pure binary equivalent
• Examples: • This code is the simplest, most intuitive binary code
for decimal digits and uses the same powers of 2 as a
– You can represent 4 elements in radix r = 2 with n
= 2 digits: (00, 01, 10, 11). binary number,
– but only encodes the first ten values from 0 to 9.
– You can represent 4 elements in radix r = 2 with n
= 4 digits: (0001, 0010, 0100, 1000). • For example: (57)dec ➔ (?) bcd

( 5 7 ) dec
= (0101 0111)bcd

73 74

73 74

Error-Detection Codes 4-Bit Parity Code Example


• Redundancy (e.g. extra information), in the form of • Fill in the even and odd parity bits:
extra bits, can be incorporated into binary code words Even Parity Odd Parity
to detect and correct errors. Message - Parity Message - Parity
• A simple form of redundancy is parity, an extra bit 000 - 000 -
appended onto the code word to make the number of 001 - 001 -
1’s odd or even. 010 - 010 -
– Parity can detect all single-bit errors and some multiple-bit 011 - 011 -
errors. 100 - 100 -
101 - 101 -
• A code word has even parity if the number of 1’s in 110 -
110 -
the code word is even. 111 - 111 -
• A code word has odd parity if the number of 1’s in the • The codeword "1111" has even parity and the
code word is odd. codeword "1110" has odd parity. Both can be used to
represent 3-bit data.
75 76

75 76

ASCII Character Codes ASCII Properties


• American Standard Code for Information Interchange • ASCII has some interesting properties:
• This code is a popular code used to represent
information sent as character-based data. • Digits 0 to 9 span Hexadecimal values 3016 to
• It uses 7- bits to represent 3916
– 94 Graphic printing characters • Upper case A-Z span 4116 to 5A16
– 34 Non-printing characters
• Some non-printing characters are used for text format • Lower case a-z span 6116 to 7A16
– e.g. BS = Backspace, CR = carriage return – Lower to upper case translation (and vice versa) occurs by
flipping bit 6
• Other non-printing characters are used for record
marking and flow control • Delete (DEL) is all bits set,
– e.g. STX = start text areas, ETX = end text areas. – a carryover from when punched paper tape was used to
store messages

77 78

77 78

Copyright 2000 N. AYDIN. All rights


reserved. 13
UNICODE Warning: Conversion or Coding?

• Do NOT mix up "conversion of a decimal


• UNICODE extends ASCII to 65,536
number to a binary number" with "coding a
universal characters codes
decimal number with a binary code".
– For encoding characters in world languages
• 1310 = 11012
– Available in many modern applications
– This is conversion
– 2 byte (16-bit) code words

• 13  0001 0011BCD
– This is coding
79 80

79 80

Another use for bits: Logic Basic Logic Operations


⚫TruthTables of Basic Operations
• Beyond numbers AND OR
NOT
– logical variables can be true or false, on or off, etc., and so A A' A B A.B A B A+B
are readily represented by the binary system. 0 0 0 0 0 0
0 1
– A logical variable A can take the values false = 0 or true = 1 1 0 0 1 0 0 1 1
only. 1 0 0 1 0 1
– The manipulation of logical variables is known as Boolean 1 1 1 1 1 1
Algebra, and has its own set of operations • Equivalent Notations
• which are not to be confused with the arithmetical operations.
– not A = A' = A
– Some basic operations: NOT, AND, OR, XOR – A and B = A.B = AB = A intersection B
– A or B = A+B = AB = A union B
81 82

81 82

More Logic Operations

XOR XNOR
A B AB A B (AB)'
0 0 0 0 0 1
0 1 1 0 1 0
1 0 1 1 0 0
1 1 0 1 1 1

– Exclusive OR (XOR): either A or B is 1, not both


– AB = A.B' + A'.B

83

83

Copyright 2000 N. AYDIN. All rights


reserved. 14
Data Mining & Knowledge Discovery Data Mining & Knowledge Discovery

Prof. Dr. Nizamettin AYDIN


Types of Data
• Outline
– Data set
naydin@yildiz.edu.tr
– Attributes and Objects
– Types of Data
http://www3.yildiz.edu.tr/~naydin – Data Quality
– Similarity and Distance
– Data Preprocessing

1 2

1 2

What is Data? Example


• A data set • A sample data set containing student information
– a collection of data objects.
• Object is AKA record, point, case, sample, entity, or
instance
• Data objects are described by a number of
attributes that capture the characteristics of an
object
– Examples – Each row corresponds to a student
• eye color of a person, temperature, the mass of a physical – Each column is an attribute that describes some
object, the time at which an event occurred, etc.
aspect of a student,
• such as GPA or ID.
3 4

3 4

What is an Attribute? Attribute Values


• An attribute Attributes • Attribute values are numbers or symbols assigned to
– a property or characteristic of an
object that can vary, either from one an attribute for a particular object
object to another or from one time to
another.
Tid Refund Marital
Status
Taxable
Income Cheat
• Distinction between attributes and attribute values
– For example, 1 Yes Single 125K No
– Same attribute can be mapped to different attribute values
• eye color varies from person to person, • Example: height can be measured in feet or meters
2 No Married 100K No
– a symbolic attribute with a small number
of possible values {brown, black, blue, 3 No Single 70K No – Different attributes can be mapped to the same set of
green, hazel, etc.}
Objects

• the temperature of an object varies over 4 Yes Married 120K No values


time. 5 No Divorced 95K Yes • Example: Attribute values for ID and age are integers
– a numerical attribute with a potentially
unlimited number of values 6 No Married 60K No
– But properties of attribute can be different than the
7 Yes Divorced 220K No
properties of the values used to represent the attribute
– Attribute is AKA variable, field, 8 No Single 85K Yes
characteristic, dimension, or feature 9 No Married 75K No
• To assign numbers or symbols to objects in a well-
10
10 No Single 90K Yes defined way, we need a measurement scale.
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Measurement Scale The Type of an Attribute
• A measurement scale is a rule (function) that • It is common to refer to the type of an attribute as
associates a numerical or symbolic value with an the type of a measurement scale.
attribute of an object.
• The process of measurement is the application of a • The values used to represent an attribute can have
measurement scale to associate a value with a properties that are not properties of the attribute
particular attribute of a specific object. itself, and vice versa.
– For example, – For example, two attributes that might be associated
• we step on a bathroom scale to determine our weight, with an employee are ID and age (in years).
• we classify someone as male or female,
• Both of these attributes can be represented as integers.
• we count the number of chairs in a room to see if there will be
enough to seat all the people coming to a meeting • However, while it is reasonable to talk about the average
– In all these cases, the physical value of an attribute of an age of an employee, it makes no sense to talk about the
object is mapped to a numerical or symbolic value. average employee ID.
7 8

7 8

The Type of an Attribute Types of Attributes


– The measurement of the length of line segments on • There are four types of attributes
two different scales of measurement. – Categorical-Nominal (Qualitative)
A
• The values of a nominal attribute are just different names;
5 1
i.e., nominal values provide only enough information to
B
This scale 7 2 This scale
distinguish one object from another. (=, ≠)
preserves C
preserves the • Operations: mode, entropy, contingency correlation, χ2 test
only the 8 3
ordering and – Examples: ID numbers, eye color, zip codes
ordering
property
additvity
properties of
– Categorical- Ordinal (Qualitative)
of length.
D
length. • The values of an ordinal attribute provide enough
10 4 information to order objects. (<, >)
• Operations: median, percentiles, rank, correlation, run
E
tests, sign tests
15 5
– Examples: rankings (e.g., taste of food on a scale from 1-10),
grades, height {tall, medium, short}

9 10

9 10

Types of Attributes Properties of Attribute Values


– Numeric-Interval (Quantitative) • The type of an attribute depends on which of the
• For interval attributes, the differences between values are following properties/operations it possesses:
meaningful, i.e., a unit of measurement exists. (+, −) – Distinctness: =,
• Operations: mean, standard deviation, Pearson’s
correlation, t and F tests
– Order : <, ≤, >,≥
– Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Differences are meaningful : +, -
– Numeric-Ratio (Quantitative) – Ratios are meaningful : , /
• For ratio variables, both differences and ratios are – Nominal attribute : distinctness
meaningful. (×, /) – Ordinal attribute : distinctness & order
• Operations: geometric mean, harmonic mean, percent
variation
– Interval attribute : distinctness, order &
meaningful differences
• Examples: temperature in Kelvin, length, counts, elapsed
time (e.g., time to run a race) – Ratio attribute : all 4 properties/operations
11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Difference Between Ratio and Interval Categorization of Attributes
Attribute Description Examples Operations
• Is it physically meaningful to say that a Type
temperature of 10° is twice that of 5° on Nominal Nominal attribute
values only
zip codes, employee
ID numbers, eye
mode, entropy,
contingency
– the Celsius scale?  distinguish. (=, ) color, sex: {male, correlation, 2

Categorical
Qualitative
female} test
– the Fahrenheit scale? 
Ordinal Ordinal attribute hardness of minerals, median,
– the Kelvin scale?  values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
• Consider measuring the height above average Interval
(<, >)
For interval
numbers
calendar dates,
tests, sign tests
mean, standard
– If Ali’s height is 3 cm above average and Veli’s attributes,
differences between
temperature in
Celsius or Fahrenheit
deviation,
Pearson's

Quantitative
Numeric
height is 6 cm above average, then would we say that values are correlation, t and
meaningful. (+, - ) F tests
Veli is twice as tall as Ali? Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
– Is this situation analogous to that of temperature? ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
13 14

13 14

Categorization of Attributes Transformations that define attribute levels

• The types of attributes can also be described in Attribute Transformation


Type
Comments

terms of transformations that do not change the Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
meaning of an attribute make any difference?
Categorical
Qualitative

– For example, the meaning of a length attribute is Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
unchanged if it is measured in meters instead of feet. new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
• The statistical operations that make sense for a by { 0.5, 1, 10}.

particular type of attribute are those that will Interval new_value = a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
Quantitative

yield the same results when the attribute is


Numeric

differ in terms of where their


transformed by using a transformation that zero value is and the size of a
unit (degree).
preserves the attribute’s meaning Ratio new_value = a * old_value Length can be measured in
meters or feet.

15 16

15 16

Describing Attributes by the Number of Values Asymmetric Attributes


• Another way of distinguishing between attributes • For asymmetric attributes, only presence (a non-zero
is by the number of values they can take. attribute value) is regarded as important
• Discrete Attribute – Words present in documents
– Has only a finite or countably infinite set of values – Items present in customer transactions
• Examples: zip codes (categorical), ID nos (categorical), counts (numeric)
– Often represented as integer variables. • If we met a friend in the grocery store, would we
– Note: binary attributes are a special case of discrete attributes ever say the following?
• Continuous Attribute “I see our purchases are very similar since we didn’t
– Has real numbers as attribute values buy most of the same things.”
• Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using • Binary attributes where only non-zero values are
a finite number of digits. important are called asymmetric binary attributes
– Continuous attributes are typically represented as floating-point
variables. – which is particularly important for association analysis
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Critiques of the attribute categorization Key Messages for Attribute Types
• Incomplete • The types of operations you choose should be
– Asymmetric binary meaningful for the type of data you have
– Cyclical
– Distinctness, order, meaningful intervals, and meaningful
– Multivariate
ratios are only four (among many possible) properties of
– Partially ordered
data
– Partial membership
– Relationships between the data – The data type you see – often numbers or strings – may
not capture all the properties or may suggest properties
that are not present
• Real data is approximate and noisy
– This can complicate recognition of the proper attribute – Analysis may depend on these other properties of the data
type • Many statistical analyses depend only on the distribution
– Treating one attribute type as another may be – In the end, what is meaningful can be specific to domain
approximately correct
19 20

19 20

Types of Data Sets Important Characteristics of Data


• There are many types of data sets, and as the • Dimensionality (number of attributes)
field of data mining develops and matures, a – The number of attributes that the objects in the data set
possess
greater variety of data sets become available for – High dimensional data brings a number of challenges
analysis. (curse of dimensionality)
• an important motivation in preprocessing the data is
• Types of data sets: dimensionality reduction
– record data, • Distribution (Sparsity)
– graph-based data, – the frequency of occurrence of various values or sets of
– ordered data values for the attributes comprising data objects
• Equivalently, the distribution of a data set can be considered as a
• These categories do not cover all possibilities and description of the concentration of objects in various regions of
the data space
other groupings are certainly possible.
21 22

21 22

Important Characteristics of Data Different variations of record data


• Resolution
– Data can be obtained at different levels of resolution, and
often the properties of the data are different at different
resolutions
– Patterns depend on the scale
• if the resolution is too fine, a pattern may not be visible or may
be buried in noise
• if the resolution is too coarse, the pattern can disappear
• Record Data (Size)
– Much data mining work assumes that the data set is a
collection of records (data objects), each of which
consists of a fixed set of data fields (attributes).
– stored either in flat files or in relational databases
– Type of analysis may depend on size of data
23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Types of data sets Record Data
• Record • Data that consists of a collection of records, each
– Data Matrix
of which consists of a fixed set of attributes
– Document Data
Tid Refund Marital Taxable
– Transaction Data Status Income Cheat

• Graph 1 Yes Single 125K No

– World Wide Web 2 No Married 100K No

– Molecular Structures 3 No Single 70K No


4 Yes Married 120K No
• Ordered 5 No Divorced 95K Yes
– Spatial Data 6 No Married 60K No
– Temporal Data 7 Yes Divorced 220K No

– Sequential Data 8 No Single 85K Yes

– Genetic Sequence Data 9 No Married 75K No


10 No Single 90K Yes
10

25 26

25 26

Transaction or Market Basket Data Transaction Data


• Transaction data is a special type of record data, • Transaction data is a collection of sets of items,
where each record (transaction) involves a set of but it can be viewed as a set of records whose
items fields are asymmetric attributes
• Consider a grocery store. – Can represent transaction data as record data
– The set of products purchased by a customer during TID Items
one shopping trip constitutes a transaction, while the 1 Bread, Coke, Milk
individual products that were purchased are the items. 2 Beer, Bread
– This type of data is called market basket data because 3 Beer, Coke, Diaper, Milk
the items in each record are the products in a person’s
4 Beer, Bread, Diaper, Milk
market basket.
5 Coke, Diaper, Milk
27 28

27 28

The Data Matrix The Data Matrix


• All the data objects in a collection of data have • A data matrix is a variation of record data, but because it
the same fixed set of numeric attributes. consists of numeric attributes, standard matrix operation
can be applied to transform and manipulate the data.
– data objects are points (vectors) in a multidimensional
space, where each dimension represents a distinct
attribute describing the object.
– A set of such data objects can be interpreted as an m
by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute.
– This matrix is called a data matrix or a pattern matrix.

29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
The Sparse Data Matrix Document Data
• special case of a data matrix where the attributes • Each document becomes a ‘term’ vector
are of the same type and are asymmetric; – Each term is a component (attribute) of the vector
– i.e., only non-zero values are important. – The value of each component is the number of times
– Transaction data is an example of a sparse data the corresponding term occurs in the document.
matrix that has only 0–1 entries.
– Another common example is document data.

timeout

season
coach

game
score
play
team

win
ball

lost
• If the order of the terms (words) in a document is
ignored—the “bag of words” approach—then a document
can be represented as a term vector, where each term is a Document 1 3 0 5 0 2 6 0 2 0 2

component (attribute) of the vector and the value of each Document 2 0 7 0 2 1 0 0 3 0 0


component is the number of times the corresponding term
occurs in the document Document 3 0 1 0 0 1 2 2 0 3 0

31 32

31 32

Graph-Based Data Graph-Based Data


• A graph can sometimes be a convenient and • Examples: Generic graph, a molecule, and webpages
powerful representation for data. 2
• We consider two specific cases 5 1
2
– The graph captures relationships among data objects 5
• The relationships among objects frequently convey
important information.
– In such cases, the data is often represented as a graph.
– The data objects themselves are represented as
graphs.
• If objects have structure, that is, the objects contain
Benzene Molecule: C6H6
subobjects that have relationships, then such objects are
(ball-and-stick diagram) Linked web pages
frequently represented as graphs.
33 34

33 34

Ordered Data Ordered Data


• Attributes have relationships that involve order in
time or space Sequences of transactions
• Sequential transaction data can be thought of as
an extension of transaction data, where each
transaction has a time associated with it.
– Consider a retail transaction data set that also stores
the time at which the transaction took place.
• This time information makes it possible to find patterns
such as “candy sales peak before Halloween.”
– A time can also be associated with each attribute.

35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Sequential Transaction Data Time Series Data
• five different times: • a special type of ordered data where each record
– t1, t2, t3, t4, and t5 is a time series ,
• three different – i.e., a series of measurements taken over time
customers: • When working with
– C1, C2, and temporal data, such as time
series, it is important to
C3 consider temporal
• five different autocorrelation;
items: – i.e., if two measurements
are close in time, then the
– A, B, C, D, values of those
and E measurements are often
very similar.

37 38

37 38

Sequence Data Spatial and Spatio-Temporal Data


• Sequence data consists of a data set that is a sequence of • Some objects have spatial attributes, such as positions or
individual entities, such as a sequence of words or areas, in addition to other types of attributes.
letters. – An example of spatial data
– It is quite similar to is weather data
(precipitation, temperature,
sequential data, except
pressure) that is collected
that there are no time for a variety of
stamps; geographical locations.
– instead, there are
positions in an ordered
sequence.
• The genetic information of plants and animals can be
• Average Monthly Temperature of land and ocean
represented in the form of sequences of nucleotides.
39 40

39 40

Spatial and Spatio-Temporal Data Handling Non-Record Data


• An important aspect of spatial data is spatial • Most data mining algorithms are designed for record
autocorrelation; i.e., objects that are physically data or its variations.
close tend to be similar in other ways as well. • Record-oriented techniques can be applied to non-
– Thus, two points on the Earth that are close to each record data by extracting features from data objects
other usually have similar values for temperature and and using these features to create a record
rainfall. corresponding to each object.
• Note that spatial autocorrelation is analogous to temporal – Consider the chemical structure data that was described
autocorrelation. earlier.
– Important examples of spatial and spatio-temporal • Given a set of common substructures, each compound can be
data are the science and engineering data sets that are represented as a record with binary attributes that indicate
whether a compound contains a specific substructure.
the result of measurements or model output taken at
• Such a representation is actually a transaction data set, where the
regularly or irregularly distributed points on a two- or transactions are the compounds, and the items are the
three dimensional grid or mesh. substructures.
41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Data Quality Data Quality
• Data mining algorithms are often applied to data that • Data is not perfect.
was collected for another purpose, or for future, but – There may be problems due to
unspecified applications. • human error,
– For that reason, data mining cannot usually take • limitations of measuring devices,
advantage of the significant benefits of “addressing • flaws in the data collection process.
quality issues at the source.”
– Values or even entire data objects can be missing.
• Data mining focuses on – There can be spurious or duplicate objects;
– the detection and correction of data quality problems • i.e., multiple data objects that all correspond to a single
– the use of algorithms that can tolerate poor data quality. “real” object
– For example, there might be two different records for a person who
• The first step, detection and correction, is often has recently lived at two different addresses.
called data cleaning.
43 44

43 44

Data Quality Measurement and Data Collection Errors

• Poor data quality negatively affects many data • The measurement error refers to any problem
processing efforts resulting from the measurement process.
– Data mining example: – A common problem is that the value recorded differs
• a classification model for detecting people who are loan risks is from the true value to some extent.
built using poor data
– Some credit-worthy candidates are denied loans
– For continuous attributes, the numerical difference of
the measured and true value is called the error.
• What kinds of data quality problems?
• The data collection error refers to errors such as
• How can we detect problems with the data?
– omitting data objects or attribute values,
• What can we do about these problems?
– inappropriately including a data object.
• Examples of data quality problems:
• Both measurement errors and data collection
– Noise and outliers, Wrong data, Fake data, Missing
values, Duplicate data errors can be either systematic or random.
45 46

45 46

Noise and Artifacts Noise and Artifacts


• Noise is the random component of a measurement error. • The term noise is often used in connection with data that
– It typically involves the distortion of a value or the addition of has a spatial or temporal component.
spurious objects Noise in a
• Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
spatial
context
• The figures below show two sine waves of the same
magnitude and different frequencies, the waves
combined, and the two sine waves with random noise
• Data errors can be the result of a more deterministic
phenomenon, such as a streak in the same place on a set
of photographs.
– Such deterministic distortions of the data are often referred to
as artifacts.
47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Precision, Bias, and Accuracy Precision, Bias, and Accuracy
• In statistics and experimental science, the quality • Example:
of the measurement process and the resulting – Suppose that we have a standard laboratory weight
data are measured by precision and bias with a mass of 1g and want to assess the precision
– Precision: and bias of our new laboratory scale.
• The closeness of repeated measurements (of the same – We weigh the mass five times, and obtain the
quantity) to one another. following five values:
– often measured by the standard deviation of a set of values • {1.015, 0.990, 1.013, 1.001, 0.986}.
– Bias: – The mean of these values is 1.001, and
• A systematic variation of measurements from the quantity • hence, the bias is 0.001.
being measured.
– measured by taking the difference between the mean of the set of
– The precision, as measured by the standard deviation,
values and the known value of the quantity being measured is 0.013.
49 50

49 50

Precision, Bias, and Accuracy Precision, Bias, and Accuracy


• It is common to use the more general term, accuracy
, to refer to the degree of measurement error in data.
– Accuracy
• The closeness of measurements to the true value of the quantity
being measured.
– Accuracy depends on precision and bias, but there is no
specific formula for accuracy in terms of these two
quantities.
– One important aspect of accuracy is the use of significant
digits.
• The goal is to use only as many digits to represent the result of a
measurement or calculation as are justified by the precision of
the data.

51 52

51 52

Precision, Bias, and Accuracy Outliers


• data objects that, in some sense, have
characteristics that are different from most of the
other data objects in the data set
• values of an attribute that are unusual with
respect to the typical values for that attribute
• can be referred to as anomalous objects or values
– Unlike noise, outliers can be legitimate data objects
or values that we are interested in detecting
• For instance, in fraud and network intrusion detection, the
goal is to find unusual objects or events from among a large
number of normal ones
53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Outliers Missing Values
• Case 1: Outliers are noise that interferes • Reasons for missing values
with data analysis – Information is not collected
(e.g., people decline to give their age and weight)
• Case 2: Outliers are – Attributes may not be applicable to all cases
the goal of our (e.g., annual income is not applicable to children)
analysis • Handling missing values
– Credit card fraud – Eliminate data objects or variables
– Intrusion detection – Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis
55 56

55 56

Inconsistent Values Inconsistent Values


• Data can contain inconsistent values • Example (Inconsistent Sea Surface Temperature)
– Consider an address field, where both a zip code and – SST data was
originally collected
city are listed, but the specified zip code area is not using ocean-based
contained in that city. It is possible that the individual measurements from
entering this information transposed two digits ships or buoys, but
more recently,
• Some types of inconsistences are easy to detect. satellites have been
– For instance, a person’s height should not be used to gather the
negative. data.
• Once an inconsistency has been detected, it is – To create a long-term
data set, both sources
sometimes possible to correct the data. of data must be used.
– The correction of an inconsistency requires additional – However, because the data comes from different sources, the
or redundant information. two parts of the data are subtly different.
57 58

57 58

Duplicate Data Duplicate Data


• Data set may include data objects that are • Examples:
duplicates, or almost duplicates of one another – Same person with multiple email addresses
– Major issue when merging data from heterogeneous – In some cases, two or more objects are identical with
sources
respect to the attributes measured by the database, but
• if there are two objects that actually represent a single
object, then one or more values of corresponding attributes they still represent different objects.
are usually different, and these inconsistent values must be • Here, the duplicates are legitimate, but can still cause
resolved. problems for some algorithms if the possibility of identical
• care needs to be taken to avoid accidentally combining data objects is not specifically accounted for in their design.
objects that are similar, but not duplicates, such as two
distinct people with identical names. • Data cleaning
– The term deduplication is often used to refer to the – Process of dealing with duplicate data issues
process of dealing with these issues. • When should duplicate data not be removed?
59 60

59 60

Copyright 2000 N. AYDIN. All rights


reserved. 10
Duplicate Data Issues Related to Applications
• Data set may include data objects that are duplicates, • Data quality issues can also be considered from
or almost duplicates of one another
– Major issue when merging data from heterogeneous
an application viewpoint as expressed by the
sources statement “data is of high quality if it is suitable
for its intended use.”
• Examples: – relevance to the specific purpose
– Same person with multiple email addresses
– completeness
• Data cleaning – accuracy
– Process of dealing with duplicate data issues – timeliness
– format
• When should duplicate data not be removed?
– cost
62

61 62

Data Preprocessing Aggregation


• Data preprocessing is a broad area and fall into two • Combining two or more attributes (or objects) into a single
categories: attribute (or object)
– selecting data objects and attributes for the analysis • Purpose
– creating/changing the attributes. – Data reduction
• reduce the number of attributes or objects
• The goal is to improve the data mining analysis with – Change of scale
respect to time, cost, and quality through: • Cities aggregated into regions, states, countries, etc.
– Aggregation • Days aggregated into weeks, months, or years
– Sampling • More stable data
– Discretization and Binarization – aggregated data tends to have less variability
Data set containing information about customer purchases
– Attribute Transformation
– Dimensionality Reduction
– Feature subset selection
– Feature creation
63 64

63 64

Example: Precipitation in Australia Example: Precipitation in Australia


• This example is based on precipitation in Australia • Variation of Precipitation in Australia
from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
• The average yearly precipitation has less variability
than the average monthly precipitation.
• All precipitation measurements (and their standard Standard Deviation of Average Monthly Standard Deviation of Average
deviations) are in centimeters. Precipitation Yearly Precipitation

65 66

65 66

Copyright 2000 N. AYDIN. All rights


reserved. 11
Sampling Sampling
• Sampling is the main technique employed for • The key principle for effective sampling is the
data reduction. following:
– It is often used for both the preliminary investigation
of the data and the final data analysis. – Using a sample will work almost as well as using the
entire data set, if the sample is representative
• Statisticians often sample because obtaining the – A sample is representative if it has approximately the
entire set of data of interest is too expensive or same properties (of interest) as the original set of data
time consuming. 8000 points 2000 Points 500 Points

• Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.
67 68

67 68

Types of Sampling Types of Sampling


• Simple Random Sampling • Progressive Sampling
– There is an equal probability of selecting any – The proper sample size can be difficult to determine,
particular item so adaptive or progressive sampling schemes are
– Sampling without replacement sometimes used.
• As each item is selected, it is removed from the population • These approaches start with a small sample, and then
– Sampling with replacement increase the sample size until a sample of sufficient size has
• Objects are not removed from the population as they are been obtained.
selected for the sample. – Finding representative points from 10 groups.
• In sampling with replacement, the same object can be Probability
picked up more than once Ten groups a sample
contains
• Stratified sampling of points
points
– Split the data into several partitions; from each
of 10
• then draw random samples from each partition groups

69 70

69 70

Dimensionality Reduction
• Data sets can have a large number of features • When dimensionality
increases, data
• There are a variety of benefits to dimensionality becomes increasingly
reduction. sparse in the space
• Many data mining algorithms work better if the that it occupies
dimensionality—the number of attributes in the • Definitions of density
data—is lower. and distance between
points, which are
– This is partly because dimensionality reduction can critical for clustering
eliminate irrelevant features and reduce noise and and outlier detection, •Randomly generate 500 points
partly because of the curse of dimensionality become less
•Compute difference between max and
meaningful min distance between any pair of points

71 72

71 72

Copyright 2000 N. AYDIN. All rights


reserved. 12
Dimensionality Reduction Principal Components Analysis (PCA)
• Purpose: • a linear algebra technique for continuous attributes
– Avoid curse of dimensionality that finds new attributes (principal components) that
– Reduce amount of time and memory required by data – are linear combinations of the original attributes,
mining algorithms – are orthogonal (perpendicular) to each other,
– Allow data to be more easily visualized – capture the maximum amount of variation in the data.
– May help to eliminate irrelevant features or reduce • For example, the first two principal components capture as much
noise of the variation in the data as is possible with two orthogonal
attributes that are linear combinations of the original attributes.
• Techniques
– Principal Components Analysis (PCA) • Singular Value Decomposition (SVD) is a linear
– Singular Value Decomposition algebra technique that is related to PCA and is also
– Others: supervised and non-linear techniques commonly used for dimensionality reduction.
73 74

73 74

Principal Components Analysis (PCA) Feature Subset Selection


• Goal is to find a projection that captures the • Another way to reduce dimensionality of data
largest amount of variation in data • Redundant features
– Duplicate much or all of the information contained in one
x2 or more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
e • Irrelevant features
– Contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for
x1
classification
75 76

75 76

Feature Creation Mapping Data to a New Space


• Create new attributes that can capture the • Fourier and wavelet transform
important information in a data set much more
efficiently than the original attributes

• Three general methodologies:


– Feature extraction
• Example: extracting edges from images
– Feature construction
• Example: dividing mass by volume to get density
– Mapping data to new space
Two Sine Waves + Noise Frequency
• Example: Fourier and wavelet analysis
77 78

77 78

Copyright 2000 N. AYDIN. All rights


reserved. 13
Mapping Data to a New Space Discretization
• Fourier and wavelet transform • Discretization is the process of converting a
continuous attribute into an ordinal attribute

• A potentially infinite number of values are


mapped into a small number of categories

• Discretization is used in both unsupervised and


supervised settings

79 80

79 80

Unsupervised Discretization Unsupervised Discretization

• Data consists of four groups of points and two outliers.


• Equal interval width approach used to obtain 4
• Data is one-dimensional, but a random y component is
values.
added to reduce overlap.
81 82

81 82

Unsupervised Discretization Unsupervised Discretization

• Equal frequency approach used to obtain 4 values. • K-means approach used to obtain 4 values.

83 84

83 84

Copyright 2000 N. AYDIN. All rights


reserved. 14
Discretization in Supervised Settings Binarization
• Many classification algorithms work best if both the independent • Binarization maps a continuous or categorical attribute
and dependent variables have only a few values into one or more binary variables
• We give an illustration of the usefulness of discretization using Conversion of a categorical attribute to three binary attributes
the following example.
Discretizing x and y attributes for four groups (classes) of points:

Conversion of a categorical attribute to five asymmetric binary attributes

85 86

85 86

Attribute Transformation Example: Sample Time Series of Plant Growth


Minneapolis

• An attribute transform is a function that maps the Net Primary


Production (NPP)
entire set of values of a given attribute to a new set is a measure of
of replacement values such that each old value can plant growth used
by ecosystem
be identified with one of the new values scientists.
– Simple functions: xk, log(x), ex, |x|
– Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean, variance,
range Correlations between time series
• Take out unwanted, common signal, e.g., seasonality Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
– In statistics, standardization refers to subtracting off the Atlanta 0.7591 1.0000 -0.5739
means and dividing by the standard deviation Sao Paolo -0.7581 -0.5739 1.0000

87 88

87 88

Seasonality Accounts for Much Correlation


Minneapolis
Normalized
using monthly Z
Score:
Subtract off
monthly mean
and divide by
monthly
standard
deviation

Correlations between time series


Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
89

89

Copyright 2000 N. AYDIN. All rights


reserved. 15
Data Mining & Knowledge Discovery Data Mining & Knowledge Discovery

Prof. Dr. Nizamettin AYDIN


Similarity and Dissimilarity
Measures
naydin@yildiz.edu.tr • Outline
– Similarity and Dissimilarity between Simple Attributes
– Dissimilarities between Data Objects
http://www3.yildiz.edu.tr/~naydin – Similarities between Data Objects
– Examples of Proximity
– Mutual Information
– Issues in Proximity
– Selecting the Right Proximity Measure
1 2

1 2

Similarity and Dissimilarity Measures Similarity and Dissimilarity Measures


• Similarity and dissimilarity are important • Similarity measure
because they are used by a number of data – Numerical measure of how alike two data objects are.
mining techniques, such as clustering, nearest – Is higher when objects are more alike.
neighbor classification, and anomaly detection. – Often falls in the range [0,1]
• In many cases, the initial data set is not needed • Dissimilarity measure
once these similarities or dissimilarities have – Numerical measure of how different two data objects are
been computed. – Lower when objects are more alike
– Minimum dissimilarity is often 0, upper limit varies
• Such approaches can be viewed as transforming
– The term distance is used as a synonym for dissimilarity
the data to a similarity (dissimilarity) space and
then performing the analysis. • Proximity refers to a similarity or dissimilarity
3 4

3 4

Transformations Transformations
• often applied to convert a similarity to a • often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular proximity measure to fall within a particular
range, such as [0,1]. range, such as [0,1].
– For instance, we may have similarities that range – For instance, we may have similarities that range
from 1 to 10, but the particular algorithm or software from 1 to 10, but the particular algorithm or software
package that we want to use may be designed to work package that we want to use may be designed to work
only with dissimilarities, or it may work only with only with dissimilarities, or it may work only with
similarities in the interval [0,1] similarities in the interval [0,1]
• Frequently, proximity measures, especially • Frequently, proximity measures, especially
similarities, are defined or transformed to have similarities, are defined or transformed to have
values in the interval [0,1]. values in the interval [0,1].
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Transformations Transformations
• Example: • However, there can be complications in mapping
proximity measures to the interval [0, 1] using a linear
– If the similarities between objects range from 1 (not transformation.
at all similar) to 10 (completely similar), we can make – If, for example, the proximity measure originally takes values
them fall within the range [0, 1] by using the in the interval [0,∞], then dmax is not defined and a nonlinear
transformation s′=(s-1)/9,where s and s′ are the transformation is needed.
original and new similarity values, respectively. – Values will not have the same relationship to one another on
the new scale.
• The transformation of similarities and • Consider the transformation d=d/(1+d) for a dissimilarity
dissimilarities to the interval [0, 1] measure that ranges from 0 to ∞.
– Given dissimilarities 0, 0.5, 2, 10, 100, 1000
– s′=(s-smin)/(smax- smin),where smax and smin are the
– Transformed dissimilarities 0, 0.33, 0.67, 0.90, 0.99, 0.999.
maximum and minimum similarity values.
• Larger values on the original dissimilarity scale are
– d′=(d-dmin)/(dmax- dmin),where dmax and dmin are the compressed into the range of values near 1, but whether
maximum and minimum dissimilarity values. this is desirable depends on the application.
7 8

7 8

Similarity/Dissimilarity for Simple Attributes Distances - Euclidean Distance


• The following table shows the similarity and dissimilarity • The Euclidean distance, d , between two points, x
between two objects, x and y, with respect to a single,
simple attribute. and y , in one-, two-, three-, or higher-
dimensional space, is given by

• Next, we consider more complicated measures of – where n is the number of dimensions (attributes) and
proximity between objects that involve multiple attributes: xk and yk are, respectively, the kth attributes
– dissimilarities between data objects (components) of data objects x and y.
– similarities between data objects. • Standardization is necessary, if scales differ.
9 10

9 10

Distances - Euclidean Distance Distances - Minkowski Distance


3
point x y • Minkowski Distance is a generalization of
2 p1 p1 0 2 Euclidean Distance, and is given by
p3 p4 p2 2 0
1 p3 3 1
p2
0
p4 5 1
0 1 2 3 4 5 6

– where r is a parameter, n is the number of dimensions


p1 p2 p3 p4 (attributes) and xk and yk are are, respectively, the kth
p1 0 2.828 3.162 5.099 attributes (components) of data objects x and y.
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Distances - Minkowski Distance Distances - Minkowski Distance
• The following are the three most common 3
L1 p1 p2 p3 p4
p1 0 4 4 6
examples of Minkowski distances. 2 p1
p3 p4 p2 4 0 2 4
– r = 1 , City block (Manhattan, taxicab, L1 norm) 1
p2
p3
p4
4
6
2
4
0
2
2
0
distance. 0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
– A common example of this for binary vectors is the Hamming p1 0 2.828 3.162 5.099
distance, which is just the number of bits that are different between p2 2.828 0 1.414 3.162
two binary vectors point x y
p1 0 2 p3 3.162 1.414 0 2
– r = 2 , Euclidean distance (L2 norm) p2 2 0 p4 5.099 3.162 2 0
p3 3 1
– r = ∞ , Supremum (Lmax norm, L∞ norm) distance. p4 5 1
L p1 p2 p3 p4
p1 0 2 3 5
• This is the maximum difference between any component of p2 2 0 1 3
the vectors p3 3 1 0 2
p4 5 3 2 0
• Do not confuse r with n, i.e., all these distances
Distance Matrix
are defined for all numbers of dimensions.
13 14

13 14

Distances - Mahalanobis Distance Distances - Mahalanobis Distance


• Mahalonobis distance is the distance between a • In the Figure, there are 1000 points, whose x and y
point and a distribution (not between two distinct attributes have a correlation of 0.6.
points). – The Euclidean distance
– It is effectively a multivariate equivalent of the Euclidean between the two large
distance. points at the opposite
• It transforms the columns into uncorrelated variables ends of the long axis of
• Scale the columns to make their variance equal to 1 the ellipse is 14.7, but
• Finally, it calculates the Euclidean distance. Mahalanobis distance
is only 6.
• It is defined as • This is because the
Mahalanobis distance
gives less emphasis to
– where Σ−1 is the inverse of the covariance matrix of the the direction of largest
data. variance.

15 16

15 16

Distances - Mahalanobis Distance Common Properties of a Distance


• Covariance Matrix: • Distances, such as the Euclidean distance, have
some well-known properties.
0.3 0.2 • If d(x, y) is the distance between two points, x and y,
= 
0.2 0.3 then the following properties hold.
– Positivity
A: (0.5, 0.5) C • d(x, y) ≥ 0 for all x and y
• d(x, y) = 0 only if x = y
B: (0, 1) B – Symmetry
• d(x, y) = d(y, x) for all x and y
A – Triangle Inequality
C: (1.5, 1.5) • d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z

Mahal(A,B) = 5 • Measures that satisfy all three properties are known


as metrics
Mahal(A,C) = 4
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Common Properties of a Similarity A Non-symmetric Similarity Measure Example

• If s(x, y) is the similarity between points x and y, • Consider an experiment in which people are
then the typical properties of similarities are the asked to classify a small set of characters as they
following: flash on a screen.
– Positivity – The confusion matrix for this experiment records
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) how often each character is classified as itself, and
– Symmetry how often each is classified as another character.
• s(x, y) = s(y, x) for all x and y – Using the confusion matrix, we can define a
• For similarities, the triangle inequality typically similarity measure between a character x and a
character y as the number of times that x is
does not hold
misclassified as y,
– However, a similarity measure can be converted to a • but note that this measure is not symmetric.
metric distance
19 20

19 20

A Non-symmetric Similarity Measure Example Similarity Measures for Binary Data


• For example, suppose that “0” appeared 200 • Similarity measures between objects that contain
times and was classified as a “0” 160 times, but only binary attributes are called similarity
as an “o” 40 times. coefficients, and typically have values between 0
• Likewise, suppose that “o” appeared 200 times and 1.
and was classified as an “o” 170 times, but as “0” • Let x and y be two objects that consist of n
only 30 times. binary attributes.
– Then, s(0,o) = 40, but s(o, 0) = 30. – The comparison of two binary vectors, leads to the
• In such situations, the similarity measure can be following quantities (frequencies):
• f00 = the number of attributes where x is 0 and y is 0
made symmetric by setting
• f01 = the number of attributes where x is 0 and y is 1
– s′(x, y) = s′(y, x) = (s(x, y)+s(y, x))/2, • f10 = the number of attributes where x is 1 and y is 0
• where s indicates the new similarity measure. • f11 = the number of attributes where x is 1 and y is 1
21 22

21 22

Similarity Measures for Binary Data Similarity Measures for Binary Data
• Simple Matching Coefficient (SMC) • Jaccard Similarity Coefficient
– One commonly used similarity coefficient – frequently used to handle objects consisting of
asymmetric binary attributes

– This measure counts both presences and absences


equally. – This measure counts both presences and absences
• Consequently, the SMC could be used to find students who equally.
had answered questions similarly on a test that consisted • Consequently, the SMC could be used to find students who
only of true/false questions. had answered questions similarly on a test that consisted
only of true/false questions.

23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
SMC versus Jaccard: Example Cosine Similarity
• Calculate SMC and J for the binary vectors, • Cosine Similarity is one of the most common
x = (1 0 0 0 0 0 0 0 0 0) measures of document similarity
y = (0 0 0 0 0 0 1 0 0 1)
• If x and y are two document vectors, then
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
– where ′ indicates vector or matrix transpose and x,y
f11 = 0 (the number of attributes where x was 1 and y was 1) indicates the inner product of the two vectors,
• and 𝑥 is the length of vector x,
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0 + 7) / (2 + 1 + 0 + 7) = 0.7
J = (f11) / (f01 + f10 + f11)
= 0 / (2 + 1 + 0) =0
25 26

25 26

Cosine Similarity Cosine Similarity - Example


• Cosine similarity really is a measure of the • Cosine Similarity between two document vectors
(cosine of the) angle between x and y. • This example calculates the cosine similarity for the
– Thus, if the cosine similarity is 1, the following two data objects, which might represent
angle between x and y is 0◦, and x document vectors:
and y are the same except for length. x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

x,y = 3 × 1 + 2 × 0 + 0 × 0 + 5 × 0 + 0 × 0 + 0 × 0 +
– If the cosine similarity is 0, then the angle between x
0×0+2×1+0×0+0×2=5
and y is 90◦, and they do not share any terms (words).
𝑥 = 32 + 22 + 02 + 52 + 02 + 02 + 02 + 22 + 02 + 02 = 6.48
• It can also be written as 𝑦 = 12 + 02 + 02 + 02 + 02 + 02 + 02 + 12 + 02 + 22 = 2.45
x,y 5
cos x, y = = = 0.31
𝑥 × 𝑦 6.48×2.45

27 28

27 28

Extended Jaccard Coefficient Correlation


• Also known as Tanimoto Coefficient • used to measure the linear relationship between
• The extended Jaccard coefficient can be used for two sets of values that are observed together.
– Thus, correlation can measure the relationship
document data and that reduces to the Jaccard between two variables (height and weight) or between
coefficient in the case of binary attributes. two objects (a pair of temperature time series).
• This coefficient, which we shall represent as EJ, • Correlation is used much more frequently to
is defined by the following equation: measure the similarity between attributes
– since the values in two data objects come from
different attributes, which can have very different
attribute types and scales.
• There are many types of correlation
29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Correlation - Pearson’s correlation Correlation – Example (Perfect Correlation)

• between two sets of numerical values, i.e., two vectors, x • Correlation is always in the range −1 to 1.
and y, is defined by: – A correlation of 1 (−1) means that x and y have a
perfect positive (negative) linear relationship;
– where the following standard statistical notation and • that is, xk = ayk + b, where a and b are constants.
definitions are used:
• The following two vectors x and y illustrate cases
where the correlation is −1 and +1, respectively.
x = (−3, 6, 0, 3,−6) x = (3, 6, 0, 3, 6)
y = ( 1,−2, 0,−1, 2) y = (1, 2, 0, 1, 2)

corr(x, y) = −1 xk = −3yk corr(x, y) = 1 xk = 3yk

31 32

31 32

Correlation – Example (Nonlinear Relationships) Visually Evaluating Correlation


• If the correlation is 0, then there is no linear • Scatter plots
relationship between the two sets of values. showing the
– However, nonlinear relationships can still exist. similarity
• In the following example, y𝑘 = x𝑘2 , but their correlation is 0. from –1 to 1.

x = (-3, -2, -1, 0, 1, 2, 3)


y = (9, 4, 1, 0, 1, 4, 9)
y𝑘 = x𝑘2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
(−3)(5)+(−2)(0)+(−1)(−3)+(0)(−4)+(1)(−3)+(2)(0)+(3)(5)
𝑐𝑜𝑟𝑟 = =0
6 × 2.16 × 3.74
33 34

33 34

Correlation vs Cosine vs Euclidean Distance Correlation vs cosine vs Euclidean distance

• Compare the three proximity measures according to their behavior • Choice of the right proximity measure depends on
under variable transformation the domain
– scaling: multiplication by a value • What is the correct choice of proximity measure for
– translation: adding a constant the following situations?
Property Cosine Correlation Euclidean Distance
Invariant to scaling (multiplication) Yes Yes No
– Comparing documents using the frequencies of words
• Documents are considered similar if the word frequencies are
Invariant to translation (addition) No Yes No similar
• Consider the example – Comparing the temperature in Celsius of two locations
– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0) • Two locations are considered similar if the temperatures are
– ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5) similar in magnitude
– Comparing two time series of temperature measured in
Measure (x , y) (x , ys) (x , yt) Celsius
Cosine 0.9667 0.9667 0.7940 • Two time series are considered similar if their shape is similar,
Correlation 0.9429 0.9429 0.9429 – i.e., they vary in the same way over time, achieving minimums and
maximums at similar times, etc.
Euclidean Distance 1.4142 5.8310 14.2127
35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Comparison of Proximity Measures Information Based Measures
• Domain of application • Information theory is a well-developed and
– Similarity measures tend to be specific to the type of
attribute and data fundamental disciple with broad applications
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
• However, one can talk about various properties that • Some similarity measures are based on
you would like a proximity measure to have information theory
– Symmetry is a common one – Mutual information in various versions
– Tolerance to noise and outliers is another
– Ability to find more types of patterns? – Maximal Information Coefficient (MIC) and related
– Many others possible measures
• The measure must be applicable to the data and – General and can handle non-linear relationships
produce results that agree with domain knowledge – Can be complicated and time intensive to compute
37 38

37 38

Entropy
• Information relates to possible outcomes of an event • For
– transmission of a message, flip of a coin, or measurement – a variable (event), X,
of a piece of data – with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
• The more certain an outcome, the less information – the entropy of X , H(X), is given by
that it contains and vice-versa 𝑛
– For example, if a coin has two heads, then an outcome of 𝐻 𝑋 = − ෍ 𝑝𝑖 log 2 𝑝𝑖
heads provides no information
𝑖=1
– More quantitatively, the information is related the
probability of an outcome
• Entropy is between 0 and log2n and is measured in
• The smaller the probability of an outcome, the more information bits
it provides and vice-versa – Thus, entropy is a measure of how many bits it takes to
– Entropy is the commonly used measure represent an observation of X on average
39 40

39 40

Entropy Examples Entropy for Sample Data: Example


• For a coin with probability p of heads and
Hair Count p -plog2p
probability q = 1 – p of tails Color
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞 Black 75 0.75 0.3113
Brown 15 0.15 0.4105
– For p= 0.5, q = 0.5 (fair coin) H = 1
– For p = 1 or q = 1, H = 0 Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
• What is the entropy of a fair four-sided die?
Total 100 1.0 1.1540

• Maximum entropy is log25 = 2.3219


41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Entropy for Sample Data Mutual Information
• Suppose we have • used as a measure of similarity between two sets of
paired values that is sometimes used as an
– a number of observations (m) of some attribute, X, e.g., alternative to correlation, particularly when a
the hair color of students in the class, nonlinear relationship is suspected between the pairs
– where there are n different possible values of values.
– And the number of observation in the ith category is mi – This measure comes from information theory, which is
the study of how to formally define and quantify
– Then, for this sample information.
𝑛
𝑚𝑖 𝑚𝑖 – It is a measure of how much information one set of values
𝐻 𝑋 = −෍ log 2 provides about another, given that the values come in
𝑚 𝑚 pairs, e.g., height and weight.
𝑖=1
• If the two sets of values are independent, i.e., the value of one
tells us nothing about the other, then their mutual information is
0.
• For continuous data, the calculation is harder
43 44

43 44

Mutual Information Mutual Information Example


• Information one variable provides about another • Evaluating Nonlinear Relationships with Mutual Information
– Recall Example where y𝑘 = x𝑘2 , but their correlation was 0.
Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌),
where H(X,Y) is the joint entropy of X and Y, x = (−3,−2,−1, 0, 1, 2, 3) y = ( 9, 4, 1, 0, 1, 4, 9)
I(x, y) = H(x) + H(y) − H(x, y) = 1.9502 Entropy for y
𝐻 𝑋, 𝑌 = − ෍ ෍ 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
where pij is the probability that the ith value of X and
the jth value of Y occur together
Joint entropy for x and y
• For discrete variables, this is easy to compute Entropy for x
• Maximum mutual information for discrete variables
is log2(min( nX, nY ), where nX (nY) is the number of
values of X (Y)
45 46

45 46

Mutual Information Example Maximal Information Coefficient


Student Count p -plog2p Student Grade Count p -plog2p • Applies mutual information to two continuous
Status Status
Undergrad 45 0.45 0.5184
variables
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744 • Consider the possible binnings of the variables into
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928 discrete categories
Undergrad C 10 0.10 0.3322
Grade Count p -plog2p – nX × nY ≤ N0.6 where
Grad A 30 0.30 0.5211
A 35 0.35 0.5301 • nX is the number of values of X
Grad B 20 0.20 0.4644
B 50 0.50 0.5000 • nY is the number of values of Y
Grad C 5 0.05 0.2161
C 15 0.15 0.4105 • N is the number of samples (observations, data objects)
Total 100 1.00 2.2710
Total 100 1.00 1.4406
• Compute the mutual information
– Normalized by log2(min( nX, nY )
• Mutual information of Student Status and Grade • Take the highest value
= 0.9928 + 1.4406 - 2.2710 = 0.1624 • Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S.
Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no.
6062 (2011): 1518-1524.
47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
General Approach for Combining Similarities Using Weights to Combine Similarities
• Sometimes attributes are of many different types, • May not want to treat all attributes the same.
but an overall similarity is needed. – Use non-negative weights 𝜔𝑘
– For the kth attribute, compute a similarity, sk(x, y), in
the range [0, 1]. σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 =
– Define an indicator variable, k, for the kth attribute as σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘
follows:
• k = 0 if the kth attribute is an asymmetric attribute and both objects
have a value of 0, or if one of the objects has a missing value for the • Can also define a weighted form of distance
kth attribute
• k = 1 otherwise
– Compute

49 50

49 50

51

51

Copyright 2000 N. AYDIN. All rights


reserved. 9
Data Mining & Knowledge Discovery Data Mining & Knowledge Discovery

Prof. Dr. Nizamettin AYDIN


Similarity and Dissimilarity
Measures
naydin@yildiz.edu.tr • Outline
– Similarity and Dissimilarity between Simple Attributes
– Dissimilarities between Data Objects
http://www3.yildiz.edu.tr/~naydin – Similarities between Data Objects
– Examples of Proximity
– Mutual Information
– Issues in Proximity
– Selecting the Right Proximity Measure
1 2

1 2

Similarity and Dissimilarity Measures Similarity and Dissimilarity Measures


• Similarity and dissimilarity are important • Similarity measure
because they are used by a number of data – Numerical measure of how alike two data objects are.
mining techniques, such as clustering, nearest – Is higher when objects are more alike.
neighbor classification, and anomaly detection. – Often falls in the range [0,1]
• In many cases, the initial data set is not needed • Dissimilarity measure
once these similarities or dissimilarities have – Numerical measure of how different two data objects are
been computed. – Lower when objects are more alike
– Minimum dissimilarity is often 0, upper limit varies
• Such approaches can be viewed as transforming
– The term distance is used as a synonym for dissimilarity
the data to a similarity (dissimilarity) space and
then performing the analysis. • Proximity refers to a similarity or dissimilarity
3 4

3 4

Transformations Transformations
• often applied to convert a similarity to a • often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular proximity measure to fall within a particular
range, such as [0,1]. range, such as [0,1].
– For instance, we may have similarities that range – For instance, we may have similarities that range
from 1 to 10, but the particular algorithm or software from 1 to 10, but the particular algorithm or software
package that we want to use may be designed to work package that we want to use may be designed to work
only with dissimilarities, or it may work only with only with dissimilarities, or it may work only with
similarities in the interval [0,1] similarities in the interval [0,1]
• Frequently, proximity measures, especially • Frequently, proximity measures, especially
similarities, are defined or transformed to have similarities, are defined or transformed to have
values in the interval [0,1]. values in the interval [0,1].
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Transformations Transformations
• Example: • However, there can be complications in mapping
proximity measures to the interval [0, 1] using a linear
– If the similarities between objects range from 1 (not transformation.
at all similar) to 10 (completely similar), we can make – If, for example, the proximity measure originally takes values
them fall within the range [0, 1] by using the in the interval [0,∞], then dmax is not defined and a nonlinear
transformation s′=(s-1)/9,where s and s′ are the transformation is needed.
original and new similarity values, respectively. – Values will not have the same relationship to one another on
the new scale.
• The transformation of similarities and • Consider the transformation d=d/(1+d) for a dissimilarity
dissimilarities to the interval [0, 1] measure that ranges from 0 to ∞.
– Given dissimilarities 0, 0.5, 2, 10, 100, 1000
– s′=(s-smin)/(smax- smin),where smax and smin are the
– Transformed dissimilarities 0, 0.33, 0.67, 0.90, 0.99, 0.999.
maximum and minimum similarity values.
• Larger values on the original dissimilarity scale are
– d′=(d-dmin)/(dmax- dmin),where dmax and dmin are the compressed into the range of values near 1, but whether
maximum and minimum dissimilarity values. this is desirable depends on the application.
7 8

7 8

Similarity/Dissimilarity for Simple Attributes Distances - Euclidean Distance


• The following table shows the similarity and dissimilarity • The Euclidean distance, d , between two points, x
between two objects, x and y, with respect to a single,
simple attribute. and y , in one-, two-, three-, or higher-
dimensional space, is given by

• Next, we consider more complicated measures of – where n is the number of dimensions (attributes) and
proximity between objects that involve multiple attributes: xk and yk are, respectively, the kth attributes
– dissimilarities between data objects (components) of data objects x and y.
– similarities between data objects. • Standardization is necessary, if scales differ.
9 10

9 10

Distances - Euclidean Distance Distances - Minkowski Distance


3
point x y • Minkowski Distance is a generalization of
2 p1 p1 0 2 Euclidean Distance, and is given by
p3 p4 p2 2 0
1 p3 3 1
p2
0
p4 5 1
0 1 2 3 4 5 6

– where r is a parameter, n is the number of dimensions


p1 p2 p3 p4 (attributes) and xk and yk are are, respectively, the kth
p1 0 2.828 3.162 5.099 attributes (components) of data objects x and y.
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Distances - Minkowski Distance Distances - Minkowski Distance
• The following are the three most common 3
L1 p1 p2 p3 p4
p1 0 4 4 6
examples of Minkowski distances. 2 p1
p3 p4 p2 4 0 2 4
– r = 1 , City block (Manhattan, taxicab, L1 norm) 1
p2
p3
p4
4
6
2
4
0
2
2
0
distance. 0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
– A common example of this for binary vectors is the Hamming p1 0 2.828 3.162 5.099
distance, which is just the number of bits that are different between p2 2.828 0 1.414 3.162
two binary vectors point x y
p1 0 2 p3 3.162 1.414 0 2
– r = 2 , Euclidean distance (L2 norm) p2 2 0 p4 5.099 3.162 2 0
p3 3 1
– r = ∞ , Supremum (Lmax norm, L∞ norm) distance. p4 5 1
L p1 p2 p3 p4
p1 0 2 3 5
• This is the maximum difference between any component of p2 2 0 1 3
the vectors p3 3 1 0 2
p4 5 3 2 0
• Do not confuse r with n, i.e., all these distances
Distance Matrix
are defined for all numbers of dimensions.
13 14

13 14

Distances - Mahalanobis Distance Distances - Mahalanobis Distance


• Mahalonobis distance is the distance between a • In the Figure, there are 1000 points, whose x and y
point and a distribution (not between two distinct attributes have a correlation of 0.6.
points). – The Euclidean distance
– It is effectively a multivariate equivalent of the Euclidean between the two large
distance. points at the opposite
• It transforms the columns into uncorrelated variables ends of the long axis of
• Scale the columns to make their variance equal to 1 the ellipse is 14.7, but
• Finally, it calculates the Euclidean distance. Mahalanobis distance
is only 6.
• It is defined as • This is because the
Mahalanobis distance
gives less emphasis to
– where Σ−1 is the inverse of the covariance matrix of the the direction of largest
data. variance.

15 16

15 16

Distances - Mahalanobis Distance Common Properties of a Distance


• Covariance Matrix: • Distances, such as the Euclidean distance, have
some well-known properties.
0.3 0.2 • If d(x, y) is the distance between two points, x and y,
= 
0.2 0.3 then the following properties hold.
– Positivity
A: (0.5, 0.5) C • d(x, y) ≥ 0 for all x and y
• d(x, y) = 0 only if x = y
B: (0, 1) B – Symmetry
• d(x, y) = d(y, x) for all x and y
A – Triangle Inequality
C: (1.5, 1.5) • d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z

Mahal(A,B) = 5 • Measures that satisfy all three properties are known


as metrics
Mahal(A,C) = 4
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Common Properties of a Similarity A Non-symmetric Similarity Measure Example

• If s(x, y) is the similarity between points x and y, • Consider an experiment in which people are
then the typical properties of similarities are the asked to classify a small set of characters as they
following: flash on a screen.
– Positivity – The confusion matrix for this experiment records
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) how often each character is classified as itself, and
– Symmetry how often each is classified as another character.
• s(x, y) = s(y, x) for all x and y – Using the confusion matrix, we can define a
• For similarities, the triangle inequality typically similarity measure between a character x and a
character y as the number of times that x is
does not hold
misclassified as y,
– However, a similarity measure can be converted to a • but note that this measure is not symmetric.
metric distance
19 20

19 20

A Non-symmetric Similarity Measure Example Similarity Measures for Binary Data


• For example, suppose that “0” appeared 200 • Similarity measures between objects that contain
times and was classified as a “0” 160 times, but only binary attributes are called similarity
as an “o” 40 times. coefficients, and typically have values between 0
• Likewise, suppose that “o” appeared 200 times and 1.
and was classified as an “o” 170 times, but as “0” • Let x and y be two objects that consist of n
only 30 times. binary attributes.
– Then, s(0,o) = 40, but s(o, 0) = 30. – The comparison of two binary vectors, leads to the
• In such situations, the similarity measure can be following quantities (frequencies):
• f00 = the number of attributes where x is 0 and y is 0
made symmetric by setting
• f01 = the number of attributes where x is 0 and y is 1
– s′(x, y) = s′(y, x) = (s(x, y)+s(y, x))/2, • f10 = the number of attributes where x is 1 and y is 0
• where s indicates the new similarity measure. • f11 = the number of attributes where x is 1 and y is 1
21 22

21 22

Similarity Measures for Binary Data Similarity Measures for Binary Data
• Simple Matching Coefficient (SMC) • Jaccard Similarity Coefficient
– One commonly used similarity coefficient – frequently used to handle objects consisting of
asymmetric binary attributes

– This measure counts both presences and absences


equally. – This measure counts both presences and absences
• Consequently, the SMC could be used to find students who equally.
had answered questions similarly on a test that consisted • Consequently, the SMC could be used to find students who
only of true/false questions. had answered questions similarly on a test that consisted
only of true/false questions.

23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
SMC versus Jaccard: Example Cosine Similarity
• Calculate SMC and J for the binary vectors, • Cosine Similarity is one of the most common
x = (1 0 0 0 0 0 0 0 0 0) measures of document similarity
y = (0 0 0 0 0 0 1 0 0 1)
• If x and y are two document vectors, then
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
– where ′ indicates vector or matrix transpose and x,y
f11 = 0 (the number of attributes where x was 1 and y was 1) indicates the inner product of the two vectors,
• and 𝑥 is the length of vector x,
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0 + 7) / (2 + 1 + 0 + 7) = 0.7
J = (f11) / (f01 + f10 + f11)
= 0 / (2 + 1 + 0) =0
25 26

25 26

Cosine Similarity Cosine Similarity - Example


• Cosine similarity really is a measure of the • Cosine Similarity between two document vectors
(cosine of the) angle between x and y. • This example calculates the cosine similarity for the
– Thus, if the cosine similarity is 1, the following two data objects, which might represent
angle between x and y is 0◦, and x document vectors:
and y are the same except for length. x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

x,y = 3 × 1 + 2 × 0 + 0 × 0 + 5 × 0 + 0 × 0 + 0 × 0 +
– If the cosine similarity is 0, then the angle between x
0×0+2×1+0×0+0×2=5
and y is 90◦, and they do not share any terms (words).
𝑥 = 32 + 22 + 02 + 52 + 02 + 02 + 02 + 22 + 02 + 02 = 6.48
• It can also be written as 𝑦 = 12 + 02 + 02 + 02 + 02 + 02 + 02 + 12 + 02 + 22 = 2.45
x,y 5
cos x, y = = = 0.31
𝑥 × 𝑦 6.48×2.45

27 28

27 28

Extended Jaccard Coefficient Correlation


• Also known as Tanimoto Coefficient • used to measure the linear relationship between
• The extended Jaccard coefficient can be used for two sets of values that are observed together.
– Thus, correlation can measure the relationship
document data and that reduces to the Jaccard between two variables (height and weight) or between
coefficient in the case of binary attributes. two objects (a pair of temperature time series).
• This coefficient, which we shall represent as EJ, • Correlation is used much more frequently to
is defined by the following equation: measure the similarity between attributes
– since the values in two data objects come from
different attributes, which can have very different
attribute types and scales.
• There are many types of correlation
29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Correlation - Pearson’s correlation Correlation – Example (Perfect Correlation)

• between two sets of numerical values, i.e., two vectors, x • Correlation is always in the range −1 to 1.
and y, is defined by: – A correlation of 1 (−1) means that x and y have a
perfect positive (negative) linear relationship;
– where the following standard statistical notation and • that is, xk = ayk + b, where a and b are constants.
definitions are used:
• The following two vectors x and y illustrate cases
where the correlation is −1 and +1, respectively.
x = (−3, 6, 0, 3,−6) x = (3, 6, 0, 3, 6)
y = ( 1,−2, 0,−1, 2) y = (1, 2, 0, 1, 2)

corr(x, y) = −1 xk = −3yk corr(x, y) = 1 xk = 3yk

31 32

31 32

Correlation – Example (Nonlinear Relationships) Visually Evaluating Correlation


• If the correlation is 0, then there is no linear • Scatter plots
relationship between the two sets of values. showing the
– However, nonlinear relationships can still exist. similarity
• In the following example, y𝑘 = x𝑘2 , but their correlation is 0. from –1 to 1.

x = (-3, -2, -1, 0, 1, 2, 3)


y = (9, 4, 1, 0, 1, 4, 9)
y𝑘 = x𝑘2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
(−3)(5)+(−2)(0)+(−1)(−3)+(0)(−4)+(1)(−3)+(2)(0)+(3)(5)
𝑐𝑜𝑟𝑟 = =0
6 × 2.16 × 3.74
33 34

33 34

Correlation vs Cosine vs Euclidean Distance Korelasyon - Kosinüs - Öklid Mesafesi


• Compare the three proximity measures according to their behavior • Değişken dönüşümü altındaki davranışlarına göre üç yakınlık
under variable transformation ölçüsünü karşılaştıralım
– scaling: multiplication by a value – Ölçekleme (scaling ): bir değerle çarpma
– translation: adding a constant – Öteleme (translation): bir sabit ekleme
Property Cosine Correlation Euclidean Distance Özellik Kosinüs Korelasyon Öklid Uzaklığı
Invariant to scaling (multiplication) Yes Yes No Ölçekleme(çarpma) ile değişmez Evet Evet Hayır
Invariant to translation (addition) No Yes No Öteleme (toplama) ile değişmez Hayır Evet Hayır

• Consider the example • Aşağıdaki örneği göz önüne alalım:


– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0) – x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
– ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5) – ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5)

Measure (x , y) (x , ys) (x , yt) Ölçüm (x , y) (x , ys) (x , yt)


Cosine 0.9667 0.9667 0.7940 Kosinüs 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429 Korelasyon 0.9429 0.9429 0.9429
Euclidean Distance 1.4142 5.8310 14.2127 Öklid Uzaklığı 1.4142 5.8310 14.2127
35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Correlation vs cosine vs Euclidean distance Korelasyon - Kosinüs - Öklid Mesafesi
• Choice of the right proximity measure depends on • Doğru yakınlık ölçüsünün seçimi, alana bağlıdır
the domain • Aşağıdaki durumlar için doğru yakınlık ölçüsü
• What is the correct choice of proximity measure for seçimi nedir?
the following situations?
– Kelimelerin sıklıklarını kullanarak belgeleri karşılaştırma
– Comparing documents using the frequencies of words
• Kelime sıklıkları benzerse belgeler benzer kabul edilir
• Documents are considered similar if the word frequencies are
similar – İki konumun Santigrat cinsinden sıcaklıklarının
– Comparing the temperature in Celsius of two locations karşılaştırılması
• Two locations are considered similar if the temperatures are • Sıcaklık değerleri benzerse iki konum benzer kabul edilir
similar in magnitude
– Santigrat cinsinden ölçülen iki sıcaklık zaman serisinin
– Comparing two time series of temperature measured in karşılaştırılması
Celsius
• Şekilleri benzerse iki zaman serisi benzer kabul edilir,
• Two time series are considered similar if their shape is similar,
– yani., zaman içinde aynı şekilde değişirler, en yüksek ve en düşük
– i.e., they vary in the same way over time, achieving minimums and
maximums at similar times, etc. değerlerine benzer zamanlarda ulaşırlar, vs.

37 38

37 38

Comparison of Proximity Measures Yakınlık Ölçülerinin Karşılaştırılması


• Domain of application • Uygulama alanı
– Similarity measures tend to be specific to the type of – Benzerlik ölçüleri özellik ve veri türüne özgü olma
attribute and data eğilimindedir.
– Record data, images, graphs, sequences, 3D-protein – Kayıt verileri, görüntüler, grafikler, diziler, 3D-protein yapısı
structure, etc. tend to have different measures vb. farklı ölçümlere sahip olma eğilimindedir.
• However, one can talk about various properties that • Bununla birlikte, bir yakınlık ölçüsünün sahip olmasını
you would like a proximity measure to have istediğiniz çeşitli özelliklerden bahsetmek mümkündür
– Symmetry is a common one – Simetri
– Tolerance to noise and outliers is another – Gürültüye ve aykırı değerlere tolerans
– Ability to find more types of patterns? – Daha fazla desen türü bulma yeteneği
– Many others possible – …..
• The measure must be applicable to the data and • Ölçü, veriye uygulanabilir olmalı ve alan bilgisi ile
produce results that agree with domain knowledge uyumlu sonuçlar üretmelidir
39 40

39 40

Information Based Measures Bilgiye Dayalı Ölçüler


• Information theory is a well-developed and • Bilgi teorisi, geniş uygulamaları olan iyi
fundamental discipline with broad applications gelişmiş ve temel bir disiplindir.

• Some similarity measures are based on • Bazı benzerlik ölçüleri bilgi teorisine
information theory
dayanmaktadır:
– Mutual information in various versions
– Karşılıklı/Ortak Bilgi ve değişik versiyonları
– Maximal Information Coefficient (MIC) and related
measures – Azami Bilgi Katsayısı (MIC) ve ilgili ölçüler
• General and can handle non-linear relationships • Genel ve doğrusal olmayan ilişkileri yönetebilir
• Can be complicated and time intensive to • Hesaplaması karmaşık ve zaman alıcı olabilir
compute
41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Information Based Measures Bilgiye Dayalı Ölçüler
• Information relates to possible outcomes of an event • Bilgi, bir olayın olası sonuçlarıyla ilgilidir
– transmission of a message, flip of a coin, or measurement – bir mesajın iletilmesi, yazı tura atılması veya bir veri
of a piece of data parçasının ölçülmesi

• The more certain an outcome, the less information


• Bir sonuç ne kadar kesinse/muğlaksa, o kadar az/çok
that it contains and vice-versa
bilgi içerir.
– For example, if a coin has two heads, then an outcome of
heads provides no information – Örneğin, bir madeni parada iki tura varsa, o zaman tura
– More quantitatively, the information is related the sonucu hiçbir bilgi sağlamaz.
probability of an outcome – Daha nicel olarak, bilgi bir sonucun olasılığı ile ilgilidir
• The smaller the probability of an outcome, the more information • Bir sonucun olasılığı ne kadar küçükse, o kadar fazla bilgi sağlar
it provides and vice-versa ve bunun tersi de geçerlidir.
– Entropy is the commonly used measure – Entropi yaygın olarak kullanılan ölçüdür
43 44

43 44

Entropy Entropi
• For • p1, p2 …, pn gibi olasılığa sahip x1, x2 …, xn gibi
– a variable (event), X, n olası değeri (çıktısı) olan bir X değişkeni (olayı)
– with n possible values (outcomes), x1, x2 …, xn
için entropi, H(X), aşağıdaki gibidir:
– each outcome having probability, p1, p2 …, pn 𝑛
– the entropy of X , H(X), is given by
𝑛 𝐻 𝑋 = − ෍ 𝑝𝑖 log 2 𝑝𝑖
𝐻 𝑋 = − ෍ 𝑝𝑖 log 2 𝑝𝑖 𝑖=1

𝑖=1 • Entropi 0 ile log2n arasında değişir ve bit


• Entropy is between 0 and log2n and is measured in cinsinden ölçülür
bits – Bu nedenle, entropi, ortalama olarak bir X gözlemini
– Thus, entropy is a measure of how many bits it takes to temsil etmek için kaç bit gerektiğinin bir ölçüsüdür.
represent an observation of X on average
45 46

45 46

Entropy - Example Entropi - Örnek


• For a coin with probability p of heads and • Tura olasılığı p ve yazı olasılığı q = 1 - p olan bir
probability q = 1 – p of tails madeni para için entropi:
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞 𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞
– p = 0.5 , q = 0.5 için H = 1
– For p= 0.5, q = 0.5 (fair coin) H = 1
– p = 1 veya q = 1 için H = 0
– For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die? • Adil bir dört kenarlı zarın entropisi nedir?
4
4

𝐻 𝑋 = − ෍ 0.25log 2 0.25 𝐻 𝑋 = − ෍ 0.25log 2 0.25


𝑖=1
𝑖=1
47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Entropy for Sample Data Örnek Veri (Sample Data) için Entropi
• Suppose we have • Varsayalım ki bazı niteliklerle ilgili bir miktar
– a number of observations (m) of some attribute, X, e.g., gözlemimiz (m) var
the hair color of students in the class, – Örneğin n farklı olası değerin olduğu ve i.
– where there are n different possible values kategorideki gözlem sayısının mi olduğu sınıftaki
– And the number of observation in the ith category is mi öğrencilerin saç rengi.
– Then, for this sample • Bu örnek için entropi:
𝑛
𝑚𝑖 𝑚𝑖 𝑛
𝐻 𝑋 = −෍ log 2 𝑚𝑖 𝑚𝑖
𝑚 𝑚 𝐻 𝑋 = −෍ log 2
𝑖=1 𝑚 𝑚
𝑖=1

• For continuous data, the calculation is harder • Sürekli veriler için hesaplama daha zordur
49 50

49 50

Entropy - Example Entropi - Örnek

Hair Color Count p -plog2p Saç Rengi Sayısı p -plog2p


Black 75 0.75 0.3113 Siyah 75 0.75 0.3113
Brown 15 0.15 0.4105 Kahve 15 0.15 0.4105
Blond 5 0.05 0.2161 Sarışın 5 0.05 0.2161
Red 0 0.00 0 Kızıl 0 0.00 0
Other 5 0.05 0.2161 Diğer 5 0.05 0.2161
Total 100 1.0 1.1540 Toplam 100 1.0 1.1540

• 𝐻 𝑋 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖 • 𝐻 𝑋 = − σ𝑛𝑖=1 𝑝𝑖 log 2 𝑝𝑖


• Maximum entropy is log25 = 2.3219 • En yüksek entropi: log25 = 2.3219
51 52

51 52

Mutual Information Mutual Information (Karşılıklı/Ortak Bilgi)


• used as a measure of similarity between two sets of • Karşılıklı bilgi, özellikle değer çiftleri arasında
paired values that is sometimes used as an doğrusal olmayan bir ilişkiden şüphelenildiğinde,
alternative to correlation, particularly when a bazen korelasyona alternatif olarak kullanılan iki
nonlinear relationship is suspected between the pairs eşleştirilmiş değerler kümesi arasındaki benzerliğin
of values.
– This measure comes from information theory, which is
bir ölçüsü olarak kullanılır.
the study of how to formally define and quantify – Bu ölçüm, bilginin nasıl tanımlanacağı ve
information. nicelleştirileceği üzerine yapılan çalışma olan bilgi
– It is a measure of how much information one set of values teorisinden gelmektedir.
provides about another, given that the values come in – Değerlerin çiftler halinde geldiği göz önüne alındığında,
pairs, e.g., height and weight. örneğin boy ve kilo gibi, bir değer kümesinin diğeri
• If the two sets of values are independent, i.e., the value of one hakkında ne kadar bilgi sağladığının bir ölçüsüdür.
tells us nothing about the other, then their mutual information is
0. • İki değer kümesi bağımsızsa, yani birinin değeri bize diğeri
hakkında hiçbir şey söylemiyorsa, karşılıklı bilgileri 0'dır.

53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Mutual Information Mutual Information (Karşılıklı/Ortak Bilgi)

• Information one variable provides about another • Bir değişkenin diğeri hakkında sağladığı bilgi:
Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), – 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌),
where H(X,Y) is the joint entropy of X and Y, • H(X,Y) , X ve Y, nin ortak entropisidir.
𝐻 𝑋, 𝑌 = − ෍ ෍ 𝑝𝑖𝑗log 2 𝑝𝑖𝑗 𝐻 𝑋, 𝑌 = − ෍ ෍ 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗 𝑖 𝑗
burada pij , X in i. değeri ile Y nin j. değerinin birlikte olma
where pij is the probability that the ith value of X and
olasılığıdır.
the jth value of Y occur together
• For discrete variables, this is easy to compute • Ayrık değişkenler için bunu hesaplamak kolaydır
• Maximum mutual information for discrete variables – Ayrık değişkenler için maksimum karşılıklı bilgi
is log2(min(nX, nY), where nX (nY) is the number of log2(min(nX, nY); burada nX (nY), X (Y) değerlerinin
values of X (Y) sayısıdır
55 56

55 56

Mutual Information Example Karşılıklı Bilgi - Örnek


• Evaluating Nonlinear Relationships with Mutual Information • Karşılıklı Bilgi ile doğrusal olmayan ilişkileri değerlendirme
– Recall Example where y𝑘 = x𝑘2 , but their correlation was 0. – Korelasyonun 0 olduğu y𝑘 = x𝑘2 örneğini hatırlayalım.
x = (−3,−2,−1, 0, 1, 2, 3) y = ( 9, 4, 1, 0, 1, 4, 9) x = (−3,−2,−1, 0, 1, 2, 3) y = ( 9, 4, 1, 0, 1, 4, 9)
I(x, y) = H(x) + H(y) − H(x, y) = 1.9502 Entropy for y I(x, y) = H(x) + H(y) − H(x, y) = 1.9502 y nin entropisi

Joint entropy for x and y x ve y nin ortak entropisi


Entropy for x X in entropisi

57 58

57 58

Mutual Information Example Karşılıklı Bilgi - Örnek


Student Count p -plog2p Student Grade Count p -plog2p Student Count p -plog2p Student Grade Count p -plog2p
Status Status Status Status
Undergrad 45 0.45 0.5184 Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161 Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744 Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211 Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928 Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322 Undergrad C 10 0.10 0.3322
Grade Count p -plog2p Grade Count p -plog2p
Grad A 30 0.30 0.5211 Grad A 30 0.30 0.5211
A 35 0.35 0.5301 A 35 0.35 0.5301
Grad B 20 0.20 0.4644 Grad B 20 0.20 0.4644
B 50 0.50 0.5000 B 50 0.50 0.5000
Grad C 5 0.05 0.2161 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 C 15 0.15 0.4105
Total 100 1.00 2.2710 Total 100 1.00 2.2710
Total 100 1.00 1.4406 Total 100 1.00 1.4406

• Mutual information of Student Status and Grade • Öğrenci Durumu ve Harf notu için
= 0.9928 + 1.4406 - 2.2710 = 0.1624 – Karşılıklı Bilgi = 0.9928 + 1.4406 - 2.2710 = 0.1624
59 60

59 60

Copyright 2000 N. AYDIN. All rights


reserved. 10
Maximal Information Coefficient Maksimal/Azami Bilgi Katsayısı
• Applies mutual information to two continuous • Karşılıklı Bilgiyi iki sürekli değişkene uygular
variables • Değişkenlerin olası paketleme(binning)lerini ayrık
• Consider the possible binnings of the variables into kategoriler halinde değerlendirin
discrete categories – nX × nY ≤ N0.6 . Burada
– nX × nY ≤ N0.6 where • nX , X değerlerinin sayısıdır
• nX is the number of values of X • nY , Y değerlerinin sayısıdır
• nY is the number of values of Y • N , örneklerin (gözlemlerin, veri nesnelerinin) sayısıdır
• N is the number of samples (observations, data objects) • Karşılıklı Bilgiyi hesaplayın
• Compute the mutual information – log2(min( nX, nY ) kullanılarak normalize edilmiştir
– Normalized by log2(min( nX, nY ) • En büyük değeri al
• Take the highest value • Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S.
Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no.
• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. 6062 (2011): 1518-1524.
Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no.
6062 (2011): 1518-1524.
61 62

61 62

General Approach for Combining Similarities Benzerlikleri Birleştirmek İçin Genel Yaklaşım

• Sometimes attributes are of many different types, • Bazen nitelikler birçok farklı türde olabilir, ancak
but an overall similarity is needed. genel bir benzerliğe ihtiyaç vardır.
– For the kth attribute, compute a similarity, sk(x, y), in – k. nitelik için, [0, 1] aralığında bir benzerlik sk(x, y)
the range [0, 1].
hesaplayın.
– Define an indicator variable, k, for the kth attribute as
follows: – k. nitelik için bir gösterge değişkeni, k, aşağıdaki
• k = 0 if the kth attribute is an asymmetric attribute and both objects gibi tanımlayın:
have a value of 0, or if one of the objects has a missing value for the • eğer k. niteliği asimetrik bir nitelikse ve her iki nesnenin değeri de 0 ise
kth attribute veya nesnelerden biri k. niteliği için eksik bir değere sahipse k = 0
• k = 1 otherwise • aksi takdirde k = 1
– Compute – Benzerlik ölçüsünü aşağıdaki gibi hesapla
σ𝑛𝑘=1 𝛿𝑘 𝑠𝑘 (𝑥, 𝑦) σ𝑛𝑘=1 𝛿𝑘 𝑠𝑘 (𝑥, 𝑦)
similarity(𝑥, 𝑦) = similarity(𝑥, 𝑦) =
σ𝑛𝑘=1 𝛿𝑘 σ𝑛𝑘=1 𝛿𝑘
63 64

63 64

Using Weights to Combine Similarities Benzerlikleri Birleştirmek İçin Ağırlıkları Kullanma

• May not want to treat all attributes the same. • Tüm niteliklerin aynı ağırlığa sahip olması
– Use non-negative weights 𝜔𝑘 gerekmeyebilir.
– Negatif olmayan ağırlıklar (𝜔𝑘 ) kullanılabilir
σ𝑛
σ𝑛𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱, 𝐲)
𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 =
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = σ𝑛 σ𝑛𝑘=1 𝜔𝑘 𝛿𝑘
𝑘=1 𝜔𝑘 𝛿𝑘

• Can also define a weighted form of distance • Ağırlıklı bir mesafe metriği de tanımlanabilir
𝑛 1Τ𝑟
𝑟
𝑑 𝑥, 𝑦 = ෍ 𝜔𝑘 𝑥𝑘 − 𝑦𝑘
𝑘=1
65 66

65 66

Copyright 2000 N. AYDIN. All rights


reserved. 11
Data Mining & Knowledge Discovery Veri Madenciliği ve Bilgi Keşfi

Prof. Dr. Nizamettin AYDIN Prof. Dr. Nizamettin AYDIN

naydin@yildiz.edu.tr naydin@yildiz.edu.tr

http://www3.yildiz.edu.tr/~naydin http://www3.yildiz.edu.tr/~naydin

1 2

1 2

Data Mining Veri Madenciliği


Classification Sınıflandırma
Basic Concepts and Techniques Temel Kavramlar ve Teknikler
• Outline • Konular
– Basic Concepts – Temel konseptler
– General Framework for Classification – Sınıflandırma için Genel Çerçeve
– Decision Tree Classifier – Karar Ağacı Sınıflandırıcı
– Characteristics of Decision Tree Classifiers – Karar Ağacı Sınıflandırıcılarının Özellikleri
– Model Overfitting – Model aşırı öğrenme
– Model Selection – Model Seçimi
– Model Evaluation – Model Değerlendirmesi
– Model Comparison – Model karşılaştırması

3 4

3 4

Classification: Definition Sınıflandırma: Tanım


• Given a collection of records (training set ) • Bir kayıt koleksiyonu (eğitim seti) verildiğinde
– Each record is by characterized by a tuple (x,y), where x is – Her kayıt, bir (x,y) değişken gurubu (demet) ile karakterize
the attribute set and y is the class label edilir; burada x, öznitelik kümesidir ve y, sınıf etiketidir.
• x: attribute, predictor, independent variable, input • x: nitelik/öznitelik, öngörücü/yordayıcı, bağımsız değişken, girdi
• y: class, response, dependent variable, output • y: sınıf, yanıt, bağımlı değişken, çıktı
• Classification task: • Sınıflandırma görevi:
– Learn a model that maps each attribute set x into one of – Her öznitelik kümesi x'i önceden tanımlanmış y sınıf
the predefined class labels y etiketlerinden birine eşleyen bir model öğrenmek

– A classification model is an abstract representation of the – Bir sınıflandırma modeli, öznitelik kümesi ile sınıf etiketi
relationship between the attribute set and the class label. arasındaki ilişkinin soyut bir temsilidir.
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Classification: Definition Sınıflandırma: Tanım
• A classification model serves two important roles • Bir sınıflandırma modelinin, veri madenciliğinde
in data mining: iki önemli rolü vardır:
– it is used as a predictive model to classify previously – Önceden etiketlenmemiş örnekleri sınıflandırmak için
unlabeled instances tahmin modeli olarak kullanılır
• A good classification model must provide accurate • İyi bir sınıflandırma modeli, hızlı yanıt süresi ile doğru
predictions with a fast response time tahminler sağlamalıdır.
– it serves as a descriptive model to identify the – Örnekleri farklı sınıflardan ayıran özellikleri
characteristics that distinguish instances from belirlemek için bir tanımlayıcı model olarak hizmet
different classes eder.
• Useful for critical applications, such as medical diagnosis, • Böyle bir karara nasıl ulaştığını gerekçelendirmeden tahmin
where it is insufficient to have a model that makes a yapan bir modele sahip olmanın yetersiz olduğu tıbbi teşhis
prediction without justifying how it reaches such a decision gibi kritik uygulamalar için kullanışlıdır.
7 8

7 8

Examples of Classification Task Sınıflandırma Görevi Örnekleri

Task Attribute set, x Class label, y Görev Nitelik kümesi, x Sınıf etiketi, y

Categorizing Features extracted from spam or non-spam E-posta E-posta mesajı Spam ya da spam
email messages email message header mesajlarını başlığından ve değil
and content sınıflandırma içeriğinden çıkarılan
(Spam filtering) Binary Binary
(Spam filtreleme) özellikler
Identifying Features extracted from malignant or benign Tümör hücrelerini Röntgen veya MRI Kötü huylu veya iyi
tumor cells x-rays or MRI scans cells tanımlama taramalarından elde huylu hücreler
Binary edilen özellikler Binary

Cataloging Features extracted from Elliptical, spiral, or Galaksileri Teleskop Eliptik, sarmal veya
galaxies telescope images irregular-shaped kataloglama görüntülerinden düzensiz şekilli
galaxies çıkarılan özellikler galaksiler
Multiclass Multiclass

9 10

9 10

Example-Vertebrate Classification Örnek-Omurgalı Sınıflandırması


• A sample data set for classifying vertebrates into mammals, • Omurgalıları memeliler, sürüngenler, kuşlar, balıklar ve amfibiler
reptiles, birds, fishes, and amphibians. olarak sınıflandırmak için örnek bir veri seti.
– The attribute set includes characteristics of the vertebrate such as its body – Nitelik seti, vücut ısısı, deri örtüsü ve uçma yeteneği gibi omurgalıların
temperature, skin cover, and ability to fly. özelliklerini içerir.

11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Example-Loan Borrower Classification Örnek-Kredi Alan Sınıflandırması
• A sample data set for the problem of predicting whether a loan • Bir kredi borçlusunun krediyi geri ödeyip ödemeyeceğini veya
borrower will repay the loan or default on the loan payments. kredi ödemelerinde temerrüde düşüp düşmeyeceğini tahmin etme
– The attribute set includes personal information of the borrower such as problemi için örnek bir veri seti.
marital status and annual income, while the class label indicates whether – Öznitelik seti, borçlunun medeni durumu ve yıllık geliri gibi kişisel
the borrower had defaulted on the loan payments. bilgilerini içerirken, sınıf etiketi borçlunun kredi ödemelerinde temerrüde
düşüp düşmediğini gösterir.

13 14

13 14

General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve

• Classification is the task of assigning labels to • Sınıflandırma, etiketlenmemiş veri örneklerine


unlabeled data instances (test data) (test verisi) etiket atama görevidir
– a classifier is used to perform such a task – Böyle bir görevi gerçekleştirmek için bir sınıflandırıcı
kullanılır
• The model is created using a given a set of • Model, eğitim seti olarak bilinen, her bir örnek
instances, known as the training set, which için nitelik değerlerinin yanı sıra sınıf etiketlerini
contains attribute values as well as class labels içeren belirli bir örnek kümesi kullanılarak
for each instance. oluşturulur.
• Learning algorithm • Öğrenme algoritması
– the systematic approach for learning a classification – bir eğitim seti verilen bir sınıflandırma modelini
model given a training set öğrenmek için sistematik yaklaşım
15 16

15 16

General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve

• Induction • İndüksiyon
– The process of using a learning algorithm to build a – Eğitim verilerinden bir sınıflandırma modeli oluşturmak için
classification model from the training data. bir öğrenme algoritması kullanma süreci.
– AKA learning a model or building a model. – Bir model öğrenmek veya bir model oluşturmak olarak da
bilinir.
• Deduction • Çıkarım
– Process of applying a classification model on unseen test – Sınıf etiketlerini tahmin etmek için görünmeyen test
instances to predict their class labels. örneklerine bir sınıflandırma modeli uygulama süreci.
• Process of classification involves two steps: • Sınıflandırma işlemi iki adımdan oluşur:
– applying a learning algorithm to training data to learn a – bir modeli öğrenmek için eğitim verilerine bir öğrenme
model, algoritması uygulamak,
– applying the model to assign labels to unlabeled instances – etiketlenmemiş örneklere etiket atamak için modeli uygulama
• A classification technique refers to a general • Bir sınıflandırma tekniği, sınıflandırmaya yönelik genel
approach to classification bir yaklaşımı ifade eder.

17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve

• the induction and • İndükleme ve


deduction steps Çıkarım adımları
should be ayrı ayrı
performed yapılmalıdır.
separately. • İndüklenen
• the training and modelin daha
test sets should be önce hiç
independent of karşılaşmadığı
each other to örneklerin sınıf
ensure that the etiketlerini doğru
induced model bir şekilde tahmin
can accurately edebilmesini
predict the class sağlamak için
labels of instances eğitim ve test
it has never setleri birbirinden
encountered bağımsız
before. olmalıdır.
19 20

19 20

General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve

• Models that deliver such predictive insights are said to • Bu tür tahmine dayalı içgörüler sağlayan modellerin iyi bir
have good generalization performance. genelleme performansına sahip oldukları söylenebilir.
• The performance of a model (classifier) can be evaluated • Bir modelin (sınıflandırıcı) performansı, tahmin edilen
by comparing the predicted labels against the true labels etiketleri örneklerin gerçek etiketleriyle karşılaştırarak
of instances. değerlendirilebilir.
• Bu bilgi, karışıklık matrisi adı verilen bir tabloda
• This information can be summarized in a table called a
özetlenebilir.
confusion matrix.

– Each entry fij denotes the number of instances from class i – Her girdi fij , i sınıfından olduğu halde j sınıfına ait olduğu tahmin
predicted to be of class j. edilen örneklerin sayısını belirtir
• number of correct predictions: f11 + f00 • doğru tahmin sayısı : f11 + f00
• number of incorrect predictions: f10 + f01 • yanlış tahmin sayısı : f10 + f01
21 22

21 22

Classification Performance Evaluation Metrics Sınıflandırma Performansı Değerlendirme Metrikleri

𝐷𝑜ğ𝑟𝑢 𝑘𝑒𝑠𝑡𝑖𝑟𝑖𝑚 𝑠𝑎𝑦𝚤𝑠𝚤 𝑓11 +𝑓00


Doğruluk = =
𝑇𝑜𝑝𝑙𝑎𝑚 𝑘𝑒𝑠𝑡𝑖𝑟𝑖𝑚 𝑠𝑎𝑦𝚤𝑠𝚤 𝑓11 +𝑓10 +𝑓01 +𝑓00
𝑌𝑎𝑛𝑙𝚤ş 𝑘𝑒𝑠𝑡𝑖𝑟𝑖𝑚 𝑠𝑎𝑦𝚤𝑠𝚤 𝑓10 +𝑓01
Hata değeri = =
𝑇𝑜𝑝𝑙𝑎𝑚 𝑘𝑒𝑠𝑡𝑖𝑟𝑖𝑚 𝑠𝑎𝑦𝚤𝑠𝚤 𝑓11 +𝑓10 +𝑓01 +𝑓00

• The learning algorithms of most classification • Çoğu sınıflandırma tekniğinin öğrenme


techniques are designed to learn models that algoritmaları, test setine uygulandığında en
attain the highest accuracy, or equivalently, the yüksek doğruluğu veya eşdeğer olarak en düşük
lowest error rate when applied to the test set hata oranını elde eden modelleri öğrenmek için
tasarlanmıştır.
23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Classification Techniques Sınıflandırma Yöntemleri
• Base Classifiers • Temel Sınıflandırıcılar
– Decision Tree based Methods – Karar Ağacı Tabanlı Yöntemler
– Rule-based Methods – Kural Tabanlı Yöntemler
– Nearest-neighbor – En yakın komşu
– Naïve Bayes and Bayesian Belief Networks – Naive Bayes ve Bayes İnanç Ağları
– Support Vector Machines – Destek Vektör Makineleri
– Neural Networks, Deep Neural Nets – Sinir Ağları,
– Derin Sinir Ağları
• Ensemble Classifiers • Topluluk Sınıflandırıcıları
– Boosting, Bagging, Random Forests – Arttırma, Paketleme, Rastgele Ormanlar
25 26

25 26

Decision Tree Classifier


• solve a classification problem by asking a series of
carefully crafted questions about the attributes of the test
instance.
• The series of questions and their possible answers can be
organized into a hierarchical structure called a decision
tree
• The tree has three types of nodes:
– A root node,
• with no incoming links and zero or more outgoing links
– Internal nodes,
• each of which has exactly one incoming link and two or more outgoing
links.
– Leaf or terminal nodes,
• each of which has exactly one incoming link and no outgoing links.
27 28

27 28

Decision Tree Classifier Decision Tree - Example


• Every leaf node in the decision tree is associated • A decision tree for the mammal classification
with a class label problem
• The non-terminal nodes, which include the root – the root node of the
and internal nodes, contain attribute test tree here uses the
attribute Body
conditions that are typically defined using a
Temperature to define
single attribute. an attribute test
• Each possible outcome of the attribute test condition that has two
condition is associated with exactly one child of outcomes, warm and
this node. cold, resulting in two
child nodes.

29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Decision Tree - Example Decision Tree - Example
• Classifying an unlabelled vertebrate.
– The dashed lines Splitting Attributes
represent the ID
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
outcomes of 1 Yes Single 125K No Home
applying various 2 No Married 100K No Owner
Yes No
attribute test 3 No Single 70K No
4 Yes Married 120K No NO MarSt
conditions on the 5 No Divorced 95K Yes Single, Divorced Married
unlabeled 6 No Married 60K No
Income
vertebrate. 7 Yes Divorced 220K No
NO
< 80K > 80K
– The vertebrate is 8 No Single 85K Yes
9 No Married 75K No NO YES
eventually assigned 10 No Single 90K Yes
to the Non- 10

Training Data Model: Decision Tree


mammals class.
31 32

31 32

Apply Model to Test Data Apply Model to Test Data


Test Data Test Data
Start from the root of tree.
Home Marital Annual Defaulted Home Marital Annual Defaulted
Owner Status Income Borrower Owner Status Income Borrower
No Married 80K ? No Married 80K ?
Home 10

Home 10

Yes Owner No Yes Owner No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

Income NO Income NO
< 80K > 80K < 80K > 80K

NO YES NO YES

33 34

33 34

Apply Model to Test Data Apply Model to Test Data


Test Data Test Data
Home Marital Annual Defaulted Home Marital Annual Defaulted
Owner Status Income Borrower Owner Status Income Borrower
No Married 80K ? No Married 80K ?
Home 10

Home 10

Yes Owner No Yes Owner No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

Income NO Income NO
< 80K > 80K < 80K > 80K

NO YES NO YES

35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Apply Model to Test Data Apply Model to Test Data
Test Data Test Data
Home Marital Annual Defaulted Home Marital Annual Defaulted
Owner Status Income Borrower Owner Status Income Borrower
No Married 80K ? No Married 80K ?
Home 10

Home 10

Yes Owner No Yes Owner No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married Assign Defaulted to
“No”
Income NO Income NO
< 80K > 80K < 80K > 80K

NO YES NO YES

37 38

37 38

Another Example of Decision Tree Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
MarSt Single, 2 No Medium 100K No algorithm
Married Divorced 3 No Small 70K No
Home Marital Annual Defaulted
ID 4 Yes Medium 120K No
Induction
Owner Status Income Borrower
5 No Large 95K Yes
NO Home
1 Yes Single 125K No 6 No Medium 60K No
Yes Owner No 7 Yes Large 220K No Learn
2 No Married 100K No
8 No Small 85K Yes Model
3 No Single 70K No NO Income 9 No Medium 75K No

10 No Small 90K Yes


4 Yes Married 120K No < 80K > 80K Model
10

5 No Divorced 95K Yes


Training Set
NO YES Apply
6 No Married 60K No Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
7 Yes Divorced 220K No
11 No Small 55K ?
8 No Single 85K Yes 12 Yes Medium 80K ?

There could be more than one tree that 13 Yes Large 110K ?
Deduction
9 No Married 75K No
fits the same data! 14 No Small 95K ?
10 No Single 90K Yes 15 No Large 67K ?
10

10

Test Set

39 40

39 40

Decision Tree Induction General Structure of Hunt’s Algorithm


• Many Algorithms: • Let Dt be the set of training ID
Home
Owner
Marital
Status
Annual Defaulted
Income Borrower

– Hunt’s Algorithm (one of the earliest) records that reach a node t 1


2
Yes
No
Single
Married
125K
100K
No
No

– CART • General Procedure: 3


4
No
Yes
Single
Married
70K
120K
No
No

– ID3, C4.5 – If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, then 67 No Married 60K No

– SLIQ,SPRINT t is a leaf node labeled as yt 8


Yes Divorced 220K No
No Single 85K Yes

• These algorithms employ a greedy strategy to – If Dt contains records that 9


10
No
No
Married
Single
75K
90K
No
Yes

grow the decision tree in a top-down fashion belong to more than one class, 10

use an attribute test to split the Dt


– by making a series of locally optimal decisions about data into smaller subsets.
which attribute to use when partitioning the training ?
– Recursively apply the
data procedure to each subset.
41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Hunt’s Algorithm Hunt’s Algorithm
Home Marital Annual Defaulted Home Marital Annual Defaulted
ID Home ID
Owner Status Income Borrower Owner Status Income Borrower
Owner
1 Yes Single 125K No 1 Yes Single 125K No
Yes No
2 No Married 100K No Defaulted = No 2 No Married 100K No
3 No Single 70K No Defaulted = No Defaulted = No
(7,3) (7,3) 3 No Single 70K No
4 Yes Married 120K No (3,0) (4,3) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
(a) (b) 5 No Divorced 95K Yes

(3,0) (4,3) 6 No Married 60K No


7 Yes Divorced 220K No
7 Yes Divorced 220K No
8 No Single 85K Yes Home
8 No Single 85K Yes
9 No Married 75K No Owner
Yes No 9 No Married 75K No
10 No Single 90K Yes Home
Owner 10 No Single 90K Yes
10

Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Marital Married
Defaulted = No Divorced
(3,0) Status
(3,0) Single, Annual Defaulted = No
(3,0) Married
Divorced Income (3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

(1,3) (3,0) Defaulted = No Defaulted = Yes


(3,0)
(1,3) (3,0) (1,0) (0,3)
(c) (d)
(1,0) (0,3)
43 44

43 44

Design Issues of Decision Tree Induction Methods for Expressing Test Conditions
• How should training records be split? • Decision tree induction algorithms must provide
– Method for expressing test condition a method for expressing an attribute test
• depending on attribute types condition and its corresponding outcomes for
– Measure for evaluating the goodness of a test different attribute types
condition – Binary Attributes
– Nominal Attributes
• How should the splitting procedure stop? – Ordinal Attributes
– Stop splitting if all the records belong to the same – Continuous Attributes
class or have identical attribute values
– Early termination
45 46

45 46

Test Condition for Binary Attributes Test Condition for Nominal Attributes
• generates two potential outcomes • Multi-way split:
– Use as many partitions
Marital
as distinct values. Status

• Binary split: Single Divorced Married

– Divides values into two subsets


Marital Marital Marital
Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}
47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Test Condition for Ordinal Attributes Test Condition for Continuous Attributes

• Multi-way split: Shirt


Size • For continuous attributes, the attribute test
– Use as many partitions as condition can be expressed as a comparison test
distinct values
Small (e.g., A < v) producing a binary split, or as a
• Binary split: Medium Large Extra Large

range query of the form vi ≤ A < vi+1, for i = 1, . .


– Divides values into two Shirt Shirt
subsets Size Size . , k, producing a multiway split.
• some decision tree algorithms,
such as CART (Classification & Annual Annual
Income Income?
Regression Trees), produce only {Small, {Large, {Small} {Medium, Large,
> 80K?
Medium} Extra Large} Extra Large}
binary splits by considering all
< 10K > 80K
2k−1 - 1 ways of creating a binary Shirt
Size
partition of k attribute values. This grouping
Yes No

– Preserve order property among violates order


property
[10K,25K) [25K,50K) [50K,80K)

attribute values
{Small, {Medium, (i) Binary split (ii) Multi-way split
Large} Extra Large}
49 50

49 50

Splitting Based on Continuous Attributes How to determine the Best Split


• Different ways of handling • There are many measures that can be used to
– Discretization to form an ordinal categorical attribute determine the goodness of an attribute test
• Ranges can be found by equal interval bucketing, equal condition.
frequency bucketing (percentiles), or clustering.
– These measures try to give preference to attribute test
– Static conditions that partition the training instances into
• discretize once at the beginning purer subsets in the child nodes,
– Dynamic • which mostly have the same class labels.
• repeat at each node • Having purer nodes is useful since a node that
• Binary Decision: (A < v) or (A  v) has all of its training instances from the same
– consider all possible splits and finds the best cut class does not need to be expanded further.
– can be more compute intensive
51 52

51 52

How to determine the Best Split How to determine the Best Split
• Before Splitting: • Greedy approach:
– 10 records of class 0, – Nodes with purer class distribution are preferred
– 10 records of class 1
• Need a measure of node impurity:

• Which test condition is the


C0: 5 C0: 9
best? C1: 5 C1: 1
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20


High degree of impurity Low degree of impurity
c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Measures of Node Impurity Measures of Node Impurity
• Entropy 𝑐−1
• All three measures give a zero impurity value if a node
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖 (𝑡)
contains instances from a single class and maximum
𝑖=0
impurity if the node has equal proportion of instances
from multiple classes.
– where 𝒑𝒊(𝒕) is the frequency of class 𝒊 at node t, and 𝒄 • Relative magnitude of
is the total number of classes the impurity measures
• Gini Index 𝑐−1 when applied to binary
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2 classification
𝑖=0 problems.
• Misclassification error – Since there are
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖 (𝑡)] only two classes,
p0(t)+p1(t) = 1.
55 56

55 56

Measures of Node Impurity Collective Impurity of Child Nodes


• The following examples illustrate how the values • Consider an attribute test condition that splits a
of the impurity measures vary as we alter the node containing N training instances into k
class distribution. children, {v1, v2, · · · , vk}, where every child
node represents a partition of the data resulting
from one of the k outcomes of the attribute test
condition

57 58

57 58

Finding the Best Split Measure of Impurity: GINI


1. Compute impurity measure (P) before splitting • Gini Index for a given node 𝒕
𝑐−1
2. Compute impurity measure (M) after splitting 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

– Compute impurity measure of each child node 𝑖=0


• where 𝒑𝒊(𝒕) is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
– M is the weighted impurity of child nodes number of classes
– Maximum of 1−1/𝑐 when records are equally distributed
3. Choose the attribute test condition that produces among all classes, implying the least beneficial situation for
the highest gain classification
– Minimum of 0 when all records belong to one class, implying
Gain = P - M the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
• CART (Classification & Regression Trees)
or equivalently, lowest impurity measure after • SLIQ (Supervised Learning in Quest)
splitting (M) • SPRINT (Scalable Parallelizable Induction of Decision Tree)

59 60

59 60

Copyright 2000 N. AYDIN. All rights


reserved. 10
Measure of Impurity: GINI Computing Gini Index of a Single Node
• Gini Index for a given node 𝒕 • Gini Index for a given node 𝒕
𝑐−1 𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0 𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
– For 2-class problem (p, 1–p):
• GINI = 1 – p2 – (1 – p)2 = 2p (1– p)
C1 1 P(C1) = 1/6 P(C2) = 5/6
C1 0 C1 1 C1 2 C1 3 C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
C1 2 P(C1) = 2/6 P(C2) = 4/6
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

61 62

61 62

Computing Gini Index of a Collection of Nodes Binary Attributes: Computing GINI Index

• When a node 𝑝 is split into 𝑘 partitions (children) • Splits into two partitions (child nodes)
𝑘
𝑛𝑖 • Effect of Weighing partitions:
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖) – Larger and purer partitions are sought
𝑛
𝑖=1 Parent
B? C1 7
where, 𝑛𝑖 = number of records at child 𝑖, Yes No C2 5
Gini = 0.486
𝑛 = number of records at parent node 𝑝. Gini(N1)
Node N1 Node N2

= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2


= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

63 64

63 64

Categorical Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index

• Use Binary Decisions based on one


• For each distinct value, gather counts for each
Home Marital Annual
ID Defaulted
Owner Status Income
value 1 Yes Single 125K No
class in the dataset • Several Choices for the splitting 2 No Married 100K No

value 3 No Single 70K No

• Use the count matrix to make decisions – Number of possible splitting values 4 Yes Married 120K No
5 No Divorced 95K Yes
= Number of distinct values
Multi-way split Two-way split 6 No Married 60K No

(find best partition of values) • Each splitting value has a count 7 Yes Divorced 220K No
matrix associated with it 8 No Single 85K Yes

CarType CarType CarType – Class counts in each of the partitions, 9 No Married 75K No

Family Sports Luxury {Sports,


{Family} {Sports}
{Family, A ≤ v and A > v 10 No Single 90K Yes
Luxury} Luxury}
10

C1 1 8 1 C1 9 1 C1 8 2 • Simple method to choose best v Annual Income ?


C2 3 0 7 C2 7 3 C2 0 10 – For each v, scan the database to
Gini 0.163 Gini 0.468 Gini 0.167 gather count matrix and compute its ≤ 80 > 80
Gini index
Defaulted Yes 0 3
– Computationally inefficient!
Repetition of work. Defaulted No 3 4

65 66

65 66

Copyright 2000 N. AYDIN. All rights


reserved. 11
Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index

• For efficient computation: for each attribute, • For efficient computation: for each attribute,
– Sort the attribute on values – Sort the attribute on values
– Linearly scan these values, each time updating the – Linearly scan these values, each time updating the
count matrix and computing gini index count matrix and computing gini index
– Choose the split position that has the least gini index – Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No Cheat No No No Yes Yes Yes No No No No
Annual Income Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230 Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

67 68

67 68

Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index

• For efficient computation: for each attribute, • For efficient computation: for each attribute,
– Sort the attribute on values – Sort the attribute on values
– Linearly scan these values, each time updating the – Linearly scan these values, each time updating the
count matrix and computing gini index count matrix and computing gini index
– Choose the split position that has the least gini index – Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No Cheat No No No Yes Yes Yes No No No No
Annual Income Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230 Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

69 70

69 70

Continuous Attributes: Computing Gini Index Measure of Impurity: Entropy


• For efficient computation: for each attribute, • Entropy at a given node 𝒕
𝑐−1
– Sort the attribute on values 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖 (𝑡)
– Linearly scan these values, each time updating the 𝑖=0
count matrix and computing gini index where 𝒑𝒊(𝒕) is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
– Choose the split position that has the least gini index – Maximum of log2𝑐 when records are equally distributed
Cheat No No No Yes Yes Yes No No No No among all classes, implying the least beneficial situation
Annual Income for classification
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
– Minimum of 0 when all records belong to one class,
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > implying most beneficial situation for classification
• Entropy based computations are quite similar to the
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
GINI index computations
71 72

71 72

Copyright 2000 N. AYDIN. All rights


reserved. 12
Computing Entropy of a Single Node Computing Information Gain After Splitting

• Entropy at a given node 𝑐−1


𝒕 • Information Gain:
𝑘
𝑛𝑖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2𝑝𝑖 (𝑡) 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑖=0
𝑛
𝑖=1

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Parent Node, 𝑝 is split into 𝑘 partitions (children),
C2 6 Entropy = – 0 log2 0 – 1 log 1 = – 0 – 0 = 0 𝑛𝑖 is number of records in child node 𝑖

– Choose the split that achieves most reduction (maximizes


C1 1 P(C1) = 1/6 P(C2) = 5/6 GAIN)
C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 – Used in the ID3 (Iterative Dichotomiser 3) and C4.5
decision tree algorithms
C1 2 P(C1) = 2/6 P(C2) = 4/6 – Information gain is the mutual information between the
C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
class variable and the splitting variable

73 74

73 74

Problem with large number of partitions Gain Ratio


• Node impurity measures tend to prefer splits that • Gain Ratio:
𝑘
result in large number of partitions, each being 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − ෍ 𝑙𝑜𝑔2
small but pure 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜
𝑖=1
𝑛 𝑛
Parent Node, 𝑝 is split into 𝑘 partitions (children),
Gender Car Customer
Type ID 𝑛𝑖 is number of records in child node 𝑖
Yes No Family Luxury c1 c20
Sports
c10 c11 – Adjusts Information Gain by the entropy of the
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 partitioning (𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
• Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Customer ID has highest information gain because
entropy for all the children is zero – Designed to overcome the disadvantage of Information
Gain
75 76

75 76

Gain Ratio Measure of Impurity: Classification Error

• Gain Ratio: 𝑘
• Classification error at a node 𝒕
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − ෍ 𝑙𝑜𝑔2 𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖
𝑖=1
where 𝒑𝒊(𝒕) is the frequency of class 𝒊 at node 𝒕
Parent Node, 𝑝 is split into 𝑘 partitions (children),
𝑛𝑖 is number of records in child node 𝑖
– Maximum of 1−1/𝑐 when records are equally
CarType CarType CarType
Family Sports Luxury {Sports,
{Family} {Sports}
{Family, distributed among all classes, implying the least
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2 interesting situation
C2 3 0 7 7 3 C2 0 10
– Minimum of 0 when all records belong to one class,
C2
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 implying the most interesting situation

77 78

77 78

Copyright 2000 N. AYDIN. All rights


reserved. 13
Computing Error of a Single Node Misclassification Error vs Gini Index
• Classification error at a node 𝒕 A? Parent
C1 7
Yes No
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ] C2 3
𝑖
Node N1 Node N2 Gini = 0.42
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4
C1 1 P(C1) = 1/6 P(C2) = 5/6 =0 = 3/10 * 0
C2 0 3 + 7/10 * 0.489
C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
C1 2 P(C1) = 2/6 P(C2) = 4/6 error remains the
C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
same!!

79 80

79 80

Misclassification Error vs Gini Index Decision Tree Based Classification


Parent • Advantages:
A?
C1 7 – Relatively inexpensive to construct
Yes No – Extremely fast at classifying unknown records
C2 3
– Easy to interpret for small-sized trees
Node N1 Node N2 Gini = 0.42
– Robust to noise (especially when methods to avoid overfitting
are employed)
– Can easily handle redundant attributes
N1 N2 N1 N2 – Can easily handle irrelevant attributes (unless the attributes are
C1 3 4 C1 3 4 interacting)
C2 0 3 C2 1 2 • Disadvantages: .
Gini=0.342 Gini=0.416 – Due to the greedy nature of splitting criterion, interacting
attributes (that can distinguish between classes together but not
individually) may be passed over in favor of other attributed
Misclassification error for all three cases = 0.3 ! that are less discriminating.
– Each decision boundary involves only a single attribute
81 82

81 82

Characteristics of Decision Tree Classifiers Characteristics of Decision Tree Classifiers

• Applicability: • Handling Missing Values:


– Decision trees are a nonparametric approach for – A decision tree classifier can handle missing attribute
building classification models. values in a number of ways, both in the training and
• Expressiveness: the test sets.
– A decision tree provides a universal representation for • Handling Irrelevant Attributes:
discrete-valued functions. – An attribute is irrelevant if it is not useful for the
• Computational Efficiency: classification task.
– Since the number of possible decision trees can be • Handling Redundant Attributes:
very large, many decision tree algorithms employ a – An attribute is redundant if it is strongly correlated
heuristic-based approach to guide their search in the with another attribute in the data.
vast hypothesis space.
83 84

83 84

Copyright 2000 N. AYDIN. All rights


reserved. 14
Characteristics of Decision Tree Classifiers Classification Errors
• Using Rectilinear Splits: • Training errors:
– The test conditions described so far in this chapter – Errors committed on the training set
involve using only a single attribute at a time. • Test errors:
• Choice of Impurity Measure: – Errors committed on the test set
– It should be noted that the choice of impurity measure Tid
1
Attrib1
Yes
Attrib2
Large
Attrib3
125K
Class
No
Learning
algorithm • Generalization
often has little effect on the performance of decision 2

3
No

No
Medium

Small
100K

70K
No
No

4 Yes Medium 120K No


Induction errors:
tree classifiers since many of the impurity measures
5 No Large 95K Yes

6 No Medium 60K No

– Expected error of
7 Yes Large 220K No Learn
Model
are quite consistent with each other.
8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes


10

Training Set
Model
a model over
Apply
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Model
random selection
12 Yes Medium 80K ?

13
14
Yes
No
Large
Small
110K
95K
?

?
Deduction
of records from
15 No Large 67K ?
10

Test Set same distribution


85 86

85 86

Example Data Set Example Data Set


• Two class problem: • Examples of training and test sets of a two-
+ : 5400 instances dimensional classification problem
• 5000 instances
generated from a
Gaussian centered at
(10,10)
• 400 noisy instances
added
o : 5400 instances
• Generated from a
uniform distribution
• 10% of the data used for training and 90% of the Example of a 2-D data Training set using 10% data
data used for testing
87 88

87 88

Overfitting and Underfitting of Decision Trees Model Overfitting – Impact of Training Data Size

• As the model becomes more and more complex, test errors can Using twice the number of data instances
start increasing even though training error may be decreasing
– Underfitting: • Increasing the size of training data reduces the
• when model is too simple, both training and test errors are large
– Overfitting: difference between training and testing errors at a
• when model is too complex, training error is small, but test error is large given size of model
89 90

89 90

Copyright 2000 N. AYDIN. All rights


reserved. 15
Decision Tree with 5 leaf nodes Decision Tree with 50 leaf nodes

Decision Tree

Decision Tree

Decision boundaries on Training data Decision boundaries on Training data

91 92

91 92

Which tree is better? Model Overfitting – Impact of Training Data Size

Decision Tree with 50 nodes Decision Tree with 50 nodes

Decision Tree with 4 nodes

Which tree is better ? Using twice the number of data instances


Decision Tree with 50 nodes

• Increasing the size of training data reduces the


difference between training and testing errors at a
given size of model
93 94

93 94

Model Overfitting – Impact of Training Data Size Reasons for Model Overfitting
• Performance of decision trees using 20% data for • Limited training size
training (twice the original training size) – In general, as we increase the size of a training set,
the patterns learned from the training set start
resembling the true patterns in the overall data
• the effect of overfitting can be reduced by increasing the
training size
• High model complexity
– Generally, a more complex model has a better ability
to represent complex patterns in the data.
– Multiple Comparison Procedure
– AKA multiple testing problem
95 96

95 96

Copyright 2000 N. AYDIN. All rights


reserved. 16
Effect of Multiple Comparison Procedure Effect of Multiple Comparison Procedure

• Consider the task of predicting Day 1 Up • Approach:


whether stock market will rise/fall in Day 2 Down – Get 50 analysts
the next 10 trading days Day 3 Down
Day 4 Up
– Each analyst makes 10 random guesses
Day 5 Down – Choose the analyst that makes the most number of
• Random guessing: Day 6 Down correct predictions
P(correct) = 0.5 Day 7 Up
Day 8 Up
Day 9 Up • Probability that at least one analyst makes at least
• Make 10 random guesses in a row: 8 correct predictions
Day 10 Down
10  10  10  P(# correct  8) = 1 − (1 − 0.0547 )50 = 0.9399
  +   +  
P(# correct  8) =    10    = 0.0547
8 9 10
2
97 98

97 98

Effect of Multiple Comparison Procedure Effect of Multiple Comparison - Example


• Many algorithms employ the following greedy strategy:
– Initial model: M
– Alternative model: M’ = M  ,
where  is a component to be added to the model (e.g., a test
condition of a decision tree)
– Keep M’ if improvement, (M,M’) > 

• Often times,  is chosen from a set of alternative components,


 = {1, 2, …, k}
Use additional 100 noisy variables
• If many alternatives are available, one may inadvertently add generated from a uniform distribution
irrelevant components to the model, resulting in model overfitting along with X and Y as attributes.

Use 30% of the data for training and


70% of the data for testing Using only X and Y as attributes
99 100

99 100

Notes on Overfitting Model Selection


• Overfitting results in decision trees that are more • Performed during model building
complex than necessary • Purpose is to ensure that model is not overly
complex (to avoid overfitting)
• Training error does not provide a good estimate • Need to estimate generalization error
of how well the tree will perform on previously – Using Validation Set
unseen records
– Incorporating Model Complexity

• Need ways for estimating generalization errors

101 102

101 102

Copyright 2000 N. AYDIN. All rights


reserved. 17
Model Selection: Using Validation Set Model Selection: Incorporating Model Complexity

• Divide training data into two parts: • Rationale: Occam’s Razor


– Training set: – Given two models of similar generalization errors, one should
prefer the simpler model over the more complex model
• use for model building
– Validation set: – A complex model has a greater chance of being fitted
• use for estimating generalization error accidentally
• Note: validation set is not the same as test set
– Therefore, one should include model complexity when
evaluating a model
• Drawback:
– Less data available for training Gen. Error(Model) = Train. Error(Model, Train. Data) +
α × Complexity(Model)

103 104

103 104

Estimating the Complexity of Decision Trees Estimating the Complexity of Decision Trees: Example

• Pessimistic Error Estimate of decision tree T


with k leaf nodes: e(TL) = 4/24

+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6

+: 3 +: 2 +: 0 +: 1 +: 3 +: 0 =1
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to α ) Decision Tree, TL Decision Tree, TR
• Relative cost of adding a leaf node
– k: number of leaf nodes egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
– Ntrain: total number of training records
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

105 106

105 106

Estimating the Complexity of Decision Trees Minimum Description Length (MDL)


• Resubstitution Estimate:
A?
X y Yes No
X y
X1 1 0 B? X1 ?
– Using training error as an optimistic estimate of generalization X2 0 B1 B2
X2 ?
error X3 0 C? 1
A C1 C2 B X3 ?
X4 1
– Referred to as optimistic error estimate … …
0 1 X4 ?

Xn
… …
1
e(TL) = 4/24 Xn ?

e(TR) = 6/24
• Cost(Model,Data) = Cost(Data|Model) + α × Cost(Model)
– Cost is the number of bits needed for encoding.
+: 3 +: 5
-: 0 -: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6 – Search for the least costly model.
• Cost(Data|Model) encodes the misclassification errors.
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5 • Cost(Model) uses node encoding (number of children) plus
splitting condition encoding.
Decision Tree, TL Decision Tree, TR

107 108

107 108

Copyright 2000 N. AYDIN. All rights


reserved. 18
Model Selection for Decision Trees Model Selection for Decision Trees
• Pre-Pruning (Early Stopping Rule) • Post-pruning
– Stop the algorithm before it becomes a fully-grown tree
– Grow decision tree to its entirety
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
– Subtree replacement
• Stop if all the attribute values are the same • Trim the nodes of the decision tree in a bottom-up fashion
– More restrictive conditions: • If generalization error improves after trimming, replace
• Stop if number of instances is less than some user-specified threshold sub-tree by a leaf node
• Stop if class distribution of instances are independent of the available • Class label of leaf node is determined from majority class
features (e.g., using  2 test) of instances in the sub-tree
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
• Stop if estimated generalization error falls below certain threshold

109 110

109 110

Example of Post-Pruning Examples of Post-pruning


Training Error (Before splitting) = 10/30 Decision Tree:
depth = 1 :
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30 | breadth > 7 : class 1
| breadth <= 7 :
| | breadth <= 3 : Simplified Decision Tree:
Class = No 10 Training Error (After splitting) = 9/30 | | | ImagePages > 0.375 : class 0
| | | ImagePages <= 0.375 : depth = 1 :
| | | | totalPages <= 6 : class 1
Error = 10/30 Pessimistic error (After splitting) | | | | totalPages > 6 : Subtree
| ImagePages <= 0.1333 : class 1
| | | | | breadth <= 1 : class 1 | ImagePages > 0.1333 :
= (9 + 4  0.5)/30 = 11/30 | | | | | breadth > 1 : class 0
| | width > 3 :
Raising | | breadth <= 6 : class 0
| | breadth > 6 : class 1
| | | MultiIP = 0:
PRUNE! | | | | ImagePages <= 0.1333 : class 1 depth > 1 :
A? | | | | ImagePages > 0.1333 :
| | | | | breadth <= 6 : class 0
| MultiAgent = 0: class 0
| MultiAgent = 1:
| | | | | breadth > 6 : class 1
| | | MultiIP = 1:
| | totalPages <= 81 : class 0
| | | | TotalTime <= 361 : class 0 | | totalPages > 81 : class 1
| | | | TotalTime > 361 : class 1
A1 A4 depth > 1 :
| MultiAgent = 0:
| | depth > 2 : class 0
A2 A3 | | depth <= 2 : Subtree
| | | MultiIP = 1: class 0
| | | MultiIP = 0:
Replacement
| | | | breadth <= 6 : class 0
| | | | breadth > 6 :
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5 | | | | | RepeatedAccess <= 0.0322 : class 0
| | | | | RepeatedAccess > 0.0322 : class 1
Class = No 4 Class = No 4 Class = No 1 Class = No 1 | MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1

111 112

111 112

Model Evaluation Cross-validation Example


• Purpose:
• 3-fold cross-validation
– To estimate performance of classifier on previously unseen
data (test set)

• Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n

113 114

113 114

Copyright 2000 N. AYDIN. All rights


reserved. 19
Variations on Cross-validation
• Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
• Stratified cross-validation
– Guarantee the same percentage of class labels in
training and test
– Important when classes are imbalanced and the
sample is small
• Use nested cross-validation approach for model
selection and evaluation
115 116

115 116

Copyright 2000 N. AYDIN. All rights


reserved. 20
Data Mining Veri Madenciliği

Prof. Dr. Nizamettin AYDIN Prof. Dr. Nizamettin AYDIN

naydin@yildiz.edu.tr naydin@yildiz.edu.tr

http://www3.yildiz.edu.tr/~naydin http://www3.yildiz.edu.tr/~naydin

1 2

1 2

Data Mining Veri Madenciliği

Cluster Analysis Kümeleme analizi


• Outline • Konular
– Overview – Genel bakış
– K-means – Kümeleme İşlemleri
– Agglomerative Hierarchical Clustering – Kümeleme Tanımı
– DBSCAN – Kümeleme Uygulamaları
– Cluster Evaluation – Kümeleme Yöntemleri

3 4

3 4

Clustering Kümeleme
• Clustering is the process of separating similar pieces • Kümeleme birbirine benzeyen veri parçalarını
of data, and most clustering methods use distances ayırma işlemidir ve kümeleme yöntemlerinin
between data. çoğu veri arasındaki uzaklıkları kullanır.
• The process of separating objects into clusters • Nesneleri kümelere (gruplara) ayırma işlemi
(groups)
• Küme:
• Cluster:
– birbirine benzeyen nesnelerden oluşan grup
– group of similar objects
• Aynı kümedeki nesneler birbirine daha çok benzer
• Objects in the same cluster are more similar to each other
• Farklı kümedeki nesneler birbirine daha az benzer
• Objects in different sets are less alike
• Unsupervised learning: • Danışmansız öğrenme:
– It is not clear which object belongs to which class and – Hangi nesnenin hangi sınıfa ait olduğu ve sınıf sayısı
number of classes belli değil
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
What is Cluster Analysis? Kümeleme Analizi Nedir?
• Given a set of objects, place them in groups such that the objects in • Bir dizi nesne verildiğinde, bunları, bir gruptaki nesneler birbirine
a group are similar (or related) to one another and different from benzer (veya ilişkili) ve diğer gruplardaki nesnelerden farklı (veya
(or unrelated to) the objects in other groups ilgisiz) olacak şekilde gruplara yerleştirmek

Inter-cluster Kümeler arası


Intra-cluster distances are Küme içi mesafeler
distances are maximized mesafeler en maksimize edilir
minimized aza indirilir

7 8

7 8

Applications of Cluster Analysis Kümeleme Analizi Uygulamaları


• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group
• Anlamak Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN – Tarama için ilgili belgeleri 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

browsing, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN gruplandırma, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

– Group genes and proteins 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,


Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
– Benzer işlevselliğe sahip 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN genleri ve proteinleri Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

gruplandırma
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
functionality, 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

– Group stocks with similar 4


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP – Benzer fiyat 4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlumberger-UP Schlumberger-UP
price fluctuations dalgalanmalarına sahip
stokları gruplandırma

• Summarization • Özetlemek
– Büyük veri kümelerinin
– Reduce the size of large data
boyutunu azaltma
sets

Clustering precipitation Clustering precipitation


in Australia in Australia

9 10

9 10

Cluster Analysis Applications Kümeleme Analizi Uygulamaları


• Understanding the distribution of data • Verinin dağılımını anlama
• A preprocessing step for other data mining • Diğer veri madenciliği algoritmaları için bir önişleme
algorithmspattern recognition adımı
• Image processing • Örüntü tanıma
• Görüntü işleme
• Economy
• Ekonomi
• Identifying outliers • Aykırılıkları belirleme
• WWW • WWW
– Document clustering – Doküman kümeleme
– Clustering user behaviors – Kullanıcı davranışlarını kümeleme
– Clustering users – Kullanıcıları kümeleme
• Data reduction • Veri azaltma
– using cluster centers to represent objects within the cluster – küme içindeki nesnelerin temsil edilmesi için küme
merkezlerinin kullanılması
11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Notion of a Cluster can be Ambiguous Küme kavramı belirsiz olabilir

How many clusters? Six Clusters kaç küme? Altı küme

Two Clusters Four Clusters İki küme Dört küme

13 14

13 14

Types of Clusterings Kümeleme Türleri


• A clustering is a set of clusters • Bir kümeleme, bir küme takımıdır
• Hiyerarşik ve bölünmeli küme takımları
• Important distinction between hierarchical and
arasındaki önemli ayrım
partitional sets of clusters
– Bölünmeli Kümeleme
– Partitional Clustering • Veri nesnelerinin örtüşmeyen alt takımlara (kümelere)
• A division of data objects into non-overlapping subsets bölünmesi
(clusters) – Hiyerarşik kümeleme
– Hierarchical clustering • Hiyerarşik bir ağaç olarak düzenlenmiş bir dizi iç içe
geçmiş kümeler takımı
• A set of nested clusters organized as a hierarchical tree

15 16

15 16

Partitional Clustering Bölünmeli Kümeleme

Original Points A Partitional Clustering Orijinal noktalar Bir bölünmeli kümeleme

17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
Hierarchical Clustering Hiyerarşik Kümeleme

p1 p1
p3 p4 p3 p4
p2 p2
p1 p2 p3 p4 p1 p2 p3 p4

Traditional Hierarchical Clustering Traditional Dendrogram Geleneksel Hiyerarşik Kümeleme Geleneksel Dendrogram

p1 p1
p3 p4 p3 p4
p2 p2
p1 p2 p3 p4 p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram Geleneksel olmayan Hiyerarşik Geleneksel olmayan Dendrogram
Kümeleme
19 20

19 20

Other Distinctions Between Sets of Clusters Küme Takımları Arasındaki Diğer Ayrımlar

• Exclusive versus non-exclusive • Münhasır (sınırlı) ve münhasır olmayan


– In non-exclusive clusterings, points may belong to – Münhasır olmayan kümelemelerde, noktalar
multiple clusters. birden çok kümeye ait olabilir.
• Can belong to multiple classes or could be ‘border’ • Birden çok sınıfa ait olabilir veya "sınır" noktaları
points olabilir
– Fuzzy clustering (one type of non-exclusive) – Bulanık kümeleme (bir tür münhasır olmayan)
• In fuzzy clustering, a point belongs to every cluster • Bulanık kümelemede, bir nokta, 0 ile 1 arasında olan
with some weight between 0 and 1 bir ağırlık değeri ile her kümeye aittir.
• Weights must sum to 1 • Ağırlıkların toplamı 1 olmalıdır
• Probabilistic clustering has similar characteristics • Olasılığa dayalı kümeleme benzer özelliklere sahiptir
• Partial versus complete • Kısmi ve tam
– In some cases, we only want to cluster some of the – Bazı durumlarda, verilerin yalnızca bir kısmını
data kümelemek isteriz

21 22

21 22

Types of Clusters Küme Türleri


• Well-separated clusters • İyi ayrılmış kümeler

• Prototype-based clusters • Prototip tabanlı kümeler

• Contiguity-based clusters • Bitişiklik (Yakınlık) tabanlı kümeler

• Density-based clusters • Yoğunluk tabanlı kümeler

• Described by an Objective Function • Bir Amaç Fonksiyonu tarafından tanımlanan


kümeler
23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Types of Clusters: Well-Separated Küme Türleri: İyi Ayrılmış
• Well-Separated Clusters: • İyi Ayrılmış Kümeler:
– A cluster is a set of points such that any point in – Bir küme, bir kümedeki herhangi bir noktanın
a cluster is closer (or more similar) to every other kümedeki diğer tüm noktalara kümede olmayan
point in the cluster than to any point not in the herhangi bir noktadan daha yakın (veya daha
cluster. fazla benzer) olduğu bir noktalar takımıdır

3 well-separated clusters 3 iyi ayrılmış küme

25 26

25 26

Types of Clusters: Prototype-Based Küme Türleri: Prototip Tabanlı


• Prototype-based • Prototip Tabanlı
– A cluster is a set of objects such that an object in a – Bir küme, bir kümedeki bir nesnenin bir kümenin
cluster is closer (more similar) to the prototype or prototipine veya "merkezine" diğer herhangi bir
“center” of a cluster, than to the center of any other kümenin merkezinden daha yakın (daha benzer)
cluster olduğu bir nesneler takımıdır.
– The center of a cluster is often a centroid, the • Bir kümenin merkezi genellikle kümedeki tüm noktaların
average of all the points in the cluster, or a medoid, ortalaması olan bir ağırlık merkezidir veya bir kümenin
en "temsili" noktası olan bir medoiddir.
the most “representative” point of a cluster

4 center-based clusters 4 merkez tabanlı küme


27 28

27 28

Types of Clusters: Contiguity-Based Küme Türleri: Bitişiklik Tabanlı


• Contiguous Cluster (Nearest neighbor or • Bitişik Küme (En yakın komşu veya Geçişli)
Transitive) – Küme, kümedeki bir noktanın kümedeki bir veya
– A cluster is a set of points such that a point in a daha fazla başka noktaya kümede olmayan
cluster is closer (or more similar) to one or more herhangi bir noktadan daha yakın (veya daha
other points in the cluster than to any point not in fazla benzer) olduğu noktalar takımıdır.
the cluster.

8 contiguous clusters 8 bitişik küme

29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Types of Clusters: Density-Based Küme Türleri: Yoğunluk Tabanlı
• Density-based • Density-based
– A cluster is a dense region of points, which is – Bir küme, düşük yoğunluklu bölgelerle diğer
separated by low-density regions, from other yüksek yoğunluklu bölgelerden ayrılan yoğun bir
regions of high density. noktalar bölgesidir.
– Used when the clusters are irregular or intertwined, – Kümeler düzensiz veya iç içe olduğunda ve gürültü
and when noise and outliers are present. ve aykırı değerler mevcut olduğunda kullanılır.

6 density-based clusters 6 yoğunluk tabanlı küme

31 32

31 32

Types of Clusters: Objective Function Küme Tipleri: Amaç Fonksiyonu


• Clusters Defined by an Objective Function • Nesnel Fonksiyon Tarafından Tanımlanan Kümeler
– Finds clusters that minimize or maximize an objective – Bir amaç fonksiyonunu minimize eden veya maksimize
function. eden kümeleri bulur.
– Enumerate all possible ways of dividing the points into – Noktaları kümelere ayırmanın tüm olası yollarını sıralar
clusters and evaluate the `goodness' of each potential set ve verilen amaç fonksiyonunu kullanarak her potansiyel
of clusters by using the given objective function. (NP küme takımınım 'iyiliğini' değerlendirir. (NP hard)
Hard) – Küresel veya yerel hedeflere sahip olabilir.
– Can have global or local objectives. • Hiyerarşik kümeleme algoritmaları tipik olarak yerel amaçlara
• Hierarchical clustering algorithms typically have local sahiptir.
objectives • Bölmeli algoritmaların tipik olarak küresel amaçları vardır
• Partitional algorithms typically have global objectives – Küresel amaç fonksiyonu yaklaşımının bir varyasyonu,
– A variation of the global objective function approach is verileri parametreleştirilmiş bir modele uydurmaktır.
to fit the data to a parameterized model. • Model için parametreler verilerden belirlenir.
• Karışım modelleri, verilerin bir dizi istatistiksel dağılımın bir "karışımı" olduğunu
• Parameters for the model are determined from the data. varsayar.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
33 34

33 34

Characteristics of the Input Data Are Important Giriş Verilerinin Özellikleri Önemlidir

• Type of proximity or density measure • Yakınlık veya yoğunluk ölçüsü türü


– Central to clustering – Kümeleme için merkezi (önemli)
– Depends on data and application – Verilere ve uygulamaya bağlıdır
• Data characteristics that affect proximity and/or density • Yakınlığı ve/veya yoğunluğu etkileyen veri özellikleri
şunlardır
are – Boyutluluk
– Dimensionality • seyreklik
• Sparseness – Öznitelik türü
– Attribute type – Verilerdeki özel ilişkiler
– Special relationships in the data • Örneğin, otokorelasyon
• For example, autocorrelation – Verilerin dağılımı
– Distribution of the data • Gürültü ve Aykırı Değerler
• Noise and Outliers – Genellikle kümeleme algoritmasının çalışmasına müdahale
eder
– Often interfere with the operation of the clustering algorithm • Farklı boyutlarda, yoğunluklarda ve şekillerde kümeler
• Clusters of differing sizes, densities, and shapes
35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Clustering Algorithms Kümeleme Algoritmaları
• K-means and its variants • K-ortalamalar ve çeşitleri

• Hierarchical clustering
• Hiyerarşik kümeleme
• Density-based clustering
• Yoğunluk tabanlı kümeleme

37 38

37 38

K-means Clustering K-ortalamalar Kümeleme


• Partitional clustering approach • Bölümlü kümeleme yaklaşımı
• Number of clusters, K, must be specified • Küme sayısı, K, belirtilmelidir
• Each cluster is associated with a centroid (center • Her küme bir ağırlık merkezi (centroid, merkez
point) noktası) ile ilişkilendirilir
• Each point is assigned to the cluster with the • Her nokta, merkeze en yakın kümeye atanır.
closest centroid • Temel algoritma çok basittir
• The basic algorithm is very simple:

39 40

39 40

Example of K-means Clustering K-ortalamalar Kümeleme Örneği


Iteration 6
1
2
3
4
5 Iteration 6
1
2
3
4
5
3 3

2.5 2.5

2 2

1.5 1.5
y

1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x 41 x 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Example of K-means Clustering K-ortalamalar Kümeleme Örneği
Iteration 1 Iteration 2 Iteration 3 Iteration 1 Iteration 2 Iteration 3
3 3 3 3 3 3

2.5 2.5 2.5 2.5 2.5 2.5

2 2 2 2 2 2

1.5 1.5 1.5 1.5 1.5 1.5


y

y
1 1 1 1 1 1

0.5 0.5 0.5 0.5 0.5 0.5

0 0 0 0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x

Iteration 4 Iteration 5 Iteration 6 Iteration 4 Iteration 5 Iteration 6


3 3 3 3 3 3

2.5 2.5 2.5 2.5 2.5 2.5

2 2 2 2 2 2

1.5 1.5 1.5 1.5 1.5 1.5


y

y
1 1 1 1 1 1

0.5 0.5 0.5 0.5 0.5 0.5

0 0 0 0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x

43 44

43 44

K-means Clustering – Details K-ortalamalar Kümeleme – Ayrıntılar


• Simple iterative algorithm. • Basit yinelemeli bir algoritma.
– Choose initial centroids; – ilk ağırlık merkezlerini seç;
– repeat {assign each point to a nearest centroid; re-compute – tekrar {her noktayı en yakın merkeze atayın; küme
cluster centroids} merkezlerini yeniden hesapla}
– until centroids stop changing. – merkezler değişmez hale gelene kadar.
• Initial centroids are often chosen randomly. • İlk ağırlık merkezleri genellikle rastgele seçilir.
– Clusters produced can vary from one run to another – Üretilen kümeler bir çalışmadan diğerine değişebilir
• The centroid is (typically) the mean of the points in the • Ağırlık merkezi (tipik olarak) kümedeki noktaların
cluster, but other definitions are possible. ortalamasıdır, ancak başka tanımlar da mümkündür.
• K-means will converge for common proximity measures • K-ortalamalar, uygun şekilde tanımlanmış ağırlık
with appropriately defined centroid merkezi ile ortak yakınlık ölçüleri için birleşecek.
• Most of the convergence happens in the first few • Yakınsamanın çoğu ilk birkaç yinelemede gerçekleşir.
iterations. – Genellikle durma koşulu, "Nispeten birkaç nokta kümeleri
– Often the stopping condition is changed to ‘Until relatively few değiştirene kadar" olarak değiştirilir.
points change clusters’ • Karmaşıklık O( n * K * I * d ) şeklindedir.
• Complexity is O( n * K * I * d ) – n = nokta sayısı, K = küme sayısı,
– n = number of points, K = number of clusters, – I = yineleme sayısı, d = özellik sayısı
– I = number of iterations, d = number of attributes
45 46

45 46

K-means Objective Function K-ortalamalar Amaç Fonksiyonu


• A common objective function (used with • Yaygın olarak kullanılan bir amaç fonksiyonu
Euclidean distance measure) is Sum of Squared (Öklid mesafe ölçüsü ile kullanılır) Hatanın Karesi
Toplamı'dır (SSE)
Error (SSE) – Her nokta için hata, en yakın küme merkezine olan
– For each point, the error is the distance to the nearest uzaklıktır.
cluster center – SSE'yi elde etmek için bu hataların karesini alır ve
– To get SSE, we square these errors and sum them. toplarız. K

K SSE =   dist 2 (mi , x)


SSE =   dist (mi , x ) 2
i =1 xCi
i =1 xCi
– x, Ci kümesindeki bir veri noktasıdır ve mi , Ci kümesi
– x is a data point in cluster Ci and mi is the centroid için ağırlık merkezidir (ortalama)
(mean) for cluster Ci – SSE, yerel veya küresel bir minimuma ulaşana kadar K-
– SSE improves in each iteration of K-means until it K-ortalamalarının her yinelemesinde gelişir.
reaches a local or global minima.
47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Two different K-means Clusterings İki farklı K-ortalamalar Kümesi
3 3

2.5 2.5

2
Original Points 2
Orijinal noktalar
1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

3 3 3 3

2.5 2.5 2.5 2.5

2 2 2 2

1.5 1.5 1.5 1.5


y

y
1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x

Optimal Clustering Sub-optimal Clustering Optimum Kümeleme Optimum Olmayan


Kümeleme
49 50

49 50

Importance of Choosing Initial Centroids … Başlangıç Merkezlerini Seçmenin Önemi…


Iteration 5
1
2
3
4 Iteration 5
1
2
3
4
3 3

2.5 2.5

2 2

1.5 1.5
y

1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x
51 52

51 52

Importance of Choosing Initial Centroids … Başlangıç Merkezlerini Seçmenin Önemi…


Iteration 1 Iteration 2 Iteration 1 Iteration 2
3 3 3 3

2.5 2.5 2.5 2.5

2 2 2 2

1.5 1.5 1.5 1.5


y

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x

Iteration 3 Iteration 4 Iteration 5 Iteration 3 Iteration 4 Iteration 5


3 3 3 3 3 3

2.5 2.5 2.5 2.5 2.5 2.5

2 2 2 2 2 2

1.5 1.5 1.5 1.5 1.5 1.5


y

1 1 1 1 1 1

0.5 0.5 0.5 0.5 0.5 0.5

0 0 0 0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x

53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Importance of Choosing Intial Centroids Başlangıç Merkezlerini Seçmenin Önemi
• Depending on the • İlk merkez
choice of initial noktalarının seçimine
centroids, B and C bağlı olarak, B ve C
may get merged or birleşebilir veya ayrı
remain separate kalabilir

55 56

55 56

Problems with Selecting Initial Points İlk Noktaları Seçmeyle İlgili Sorunlar
• If there are K ‘real’ clusters then the chance of • K adet "gerçek" küme varsa, her kümeden bir
selecting one centroid from each cluster is small. merkez noktası seçme şansı düşüktür.
– Chance is relatively small when K is large – K büyük olduğunda şans nispeten küçüktür
– If clusters are the same size, n, then – Kümeler n boyuttaysa (aynı), o zaman

For example, if K = 10, then Örneğin, eğer K = 10, o zaman


probability P = 10!/1010 = 0.00036 olasılık P = 10!/1010 = 0.00036
– Sometimes the initial centroids will readjust – Bazen ilk ağırlık merkezleri kendilerini 'doğru'
themselves in ‘right’ way, and sometimes they don’t
şekilde yeniden ayarlarlar ve bazen de beş çift
küme örneğini dikkate almazlar.
consider an example of five pairs of clusters
57 58

57 58

Iteration 4
1
2
3 Iteration 4
1
2
3
8
10 Clusters Example 8
10 Küme Örneği
6 6

4 4

2 2
y

0 0

-2 -2

-4 -4
Starting with two initial centroids in one cluster of each pair of clusters Her bir küme çiftinin bir kümesinde iki ilk ağırlık merkezi ile başlama

-6 -6

59 60

0 5 10 15 20 0 5 10 15 20
59 x 60 x

Copyright 2000 N. AYDIN. All rights


reserved. 10
10 Clusters Example 10 Küme Örneği
Iteration 1 Iteration 2 Iteration 1 Iteration 2
8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2
y

y
0 0 0 0

-2 -2 -2 -2

-4 -4 -4 -4

-6 -6 -6 -6

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x
Iteration 3 Iteration 4 Iteration 3 Iteration 4
8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2
y

y
0 0 0 0

-2 -2 -2 -2

-4 -4 -4 -4

-6 -6 -6 -6

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x

Starting with two initial centroids in one cluster of each pair of clusters Her bir küme çiftinin bir kümesinde iki ilk ağırlık merkezi ile başlama

61 62

61 62

Iteration 4
1
2
3 Iteration 4
1
2
3
8 8

6 10 Clusters Example 6 10 Küme Örneği


4 4

2 2
y

0 0

-2 -2

-4 -4

-6 Starting with some pairs of clusters having three initial centroids, while other -6 Diğerleri yalnızca bir taneye sahipken üç başlangıç ağırlık merkezine sahip
have only one. bazı küme çiftleriyle başlama

0 5 10 15 20 63 0 5 10 15 20 64
x x
63 64

10 Clusters Example 10 Küme Örneği


Iteration 1 Iteration 2 Iteration 1 Iteration 2
8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2
y

0 0 0 0

-2 -2 -2 -2

-4 -4 -4 -4

-6 -6 -6 -6

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4 x
Iteration 3 x
Iteration 4
8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2
y

0 0 0 0

-2 -2 -2 -2

-4 -4 -4 -4

-6 -6 -6 -6

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x

Starting with some pairs of clusters having three initial centroids, while other have only one. Diğerleri yalnızca bir taneye sahipken üç başlangıç ağırlık merkezine sahip bazı küme
çiftleriyle başlama
65 66

65 66

Copyright 2000 N. AYDIN. All rights


reserved. 11
Solutions to Initial Centroids Problem Başlangıç Merkezleri Sorununa Çözümler

• Multiple runs • Çoklu çalıştırma


– Helps, but probability is not on your side – Yardımcı olur, ancak olasılık sizden yana değil
• Use some strategy to select the k initial centroids • k başlangıç ağırlık merkezini seçmek için bazı
and then select among these initial centroids stratejiler kullanın ve ardından bu başlangıç
– Select most widely separated ağırlık merkezleri arasından seçim yapın
• K-means++ is a robust way of doing this selection – En yaygın şekilde ayrılmış olanı seçin
– Use hierarchical clustering to determine initial • K-means++, bu seçimi yapmanın sağlam bir yoludur
centroids – İlk merkez noktalarını belirlemek için hiyerarşik
• Bisecting K-means kümelemeyi kullanın
– Not as susceptible to initialization issues • K-ortalamalarını ikiye bölme
– Başlatma sorunlarına duyarlı değil
67 68

67 68

K-means++ K-means++
• This approach can be slower than random initialization, • Bu yaklaşım rasgele başlatmadan daha yavaş olabilir,
but very consistently produces better results in terms of ancak çok tutarlı bir şekilde SSE (Sum of Squared Error)
SSE açısından daha iyi sonuçlar verir.
– The k-means++ algorithm guarantees an approximation ratio – K-means++ algoritması, k'nin merkezlerin sayısı olduğu bir
O(log k) in expectation, where k is the number of centers beklentide O(log k) yaklaşık oranını garanti eder.
• To select a set of initial centroids, C, perform the • Bir C başlangıç merkez noktaları kümesi seçmek için,
following aşağıdakileri gerçekleştirin

1. Select an initial point at random to be the first centroid 1. Select an initial point at random to be the first centroid
2. For k – 1 steps 2. For k – 1 steps
3. For each of the N points, xi, 1 ≤ i ≤ N, find the minimum squared 3. For each of the N points, xi, 1 ≤ i ≤ N, find the minimum squared
distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < k, distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < k,
i.e.,min d2( Cj, xi ) i.e.,min d2( Cj, xi )
𝑗
𝑗
4. Randomly select a new centroid by choosing a point with probability 4. Randomly select a new centroid by choosing a point with probability
min d2( Cj, xi )
𝑗 min d2( Cj, xi )
proportional to σ min d2( C , x )is 𝑗
proportional to σ is
𝑖 𝑗 j i
d2( C , x )
𝑖 min
𝑗 j i
5. End For
5. End For
69 70

69 70

Bisecting K-means K-ortalamalarını ikiye bölme


• Bisecting K-means algorithm • K-ortalamalarını ikiye bölme algoritması
– Variant of K-means that can produce a partitional – Bölümlü veya hiyerarşik bir kümeleme üretebilen
or a hierarchical clustering K-ortalamalarının bir değişiği

CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

71 72

71 72

Copyright 2000 N. AYDIN. All rights


reserved. 12
Bisecting K-means Example K-ortalamalarını ikiye bölme örneği

73 74

73 74

Limitations of K-means K-ortalamalarının kısıtları


• K-means has problems when clusters are of • Kümeler farklı
differing – Boyutlara
– Sizes – Yoğunluklara
– Densities – Küresel olmayan şekillere
– Non-globular shapes sahip olduğunda K-ortalamalarının sorunları
vardır
• K-means has problems when the data contains • Veriler aykırı değerler içerdiğinde K-
outliers. ortalamalarının sorunları vardır.
– One possible solution is to remove outliers before – Muhtemel bir çözüm, kümelemeden önce aykırı
clustering değerleri kaldırmaktır.
75 76

75 76

Limitations of K-means: Differing Sizes K-ortalamalarının kısıtları: Farklı Boyutlar

Original Points K-means (3 Clusters) Orijinal Noktalar K-ortalamalar (3 Küme)

77 78

77 78

Copyright 2000 N. AYDIN. All rights


reserved. 13
Limitations of K-means: Differing Density K-ortalamalarının kısıtları: Farklı Yoğunluk

Original Points K-means (3 Clusters) Orijinal Noktalar K-ortalamalar (3 Küme)

79 80

79 80

Limitations of K-means: Non-globular Shapes K-ortalamalarının kısıtları : Küresel olmayan şekiller

Original Points K-means (2 Clusters) Orijinal Noktalar K-ortalamalar (2 Küme)

81 82

81 82

Overcoming K-means Limitations K-ortalamalar Kısıtlarının Üstesinden Gelmek

Original Points K-means Clusters Orijinal Noktalar K-ortalamalar Kümeleri

• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
83 84

83 84

Copyright 2000 N. AYDIN. All rights


reserved. 14
Overcoming K-means Limitations K-ortalamalar Kısıtlarının Üstesinden Gelmek

Original Points K-means Clusters Orijinal Noktalar K-ortalamalar Kümeleri

• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
85 86

85 86

Overcoming K-means Limitations K-ortalamalar Kısıtlarının Üstesinden Gelmek

Original Points K-means Clusters Orijinal Noktalar K-ortalamalar Kümeleri

• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
87 88

87 88

Hierarchical Clustering Hiyerarşik Kümeleme


• Produces a set of nested clusters organized as a • Hiyerarşik bir ağaç olarak düzenlenmiş bir dizi iç
hierarchical tree içe geçmiş küme üretir
• Can be visualized as a dendrogram • Bir dendrogram olarak görselleştirilebilir
– A tree like diagram that records the sequences of – Birleştirme veya bölme dizilerini kaydeden ağaç
merges or splits benzeri bir diyagram
6 5 6 5
0.2 0.2
4 4
3 4 3 4
0.15 2 0.15 2
5 5

0.1 2 0.1 2

1 0.05
1
0.05
3 1 3 1

0 0
1 3 2 5 4 6 1 3 2 5 4 6

89 90

89 90

Copyright 2000 N. AYDIN. All rights


reserved. 15
Strengths of Hierarchical Clustering Hiyerarşik Kümelemenin Güçlü Yönleri
• Do not have to assume any particular number of • Belirli sayıda küme varsaymak zorunda
clusters değilsiniz
– Any desired number of clusters can be obtained by – Dendrogramı uygun seviyede 'keserek' istenilen
‘cutting’ the dendrogram at the proper level sayıda küme elde edilebilir.
• They may correspond to meaningful taxonomies • Anlamlı taksonomilere karşılık gelebilirler
– Example in biological crocodiles – Biyolojik bilimlerden crocodiles
sciences (e.g., animal birds örnek (örneğin, birds
kingdom, phylogeny lizards hayvanlar alemi, lizards
reconstruction, …) snakes soyoluş (phylogeny) snakes
rodents oluşturma, …) rodents
primates primates
marsupials marsupials

91 92

91 92

Tree Terminology Ağaç Terminolojisi


• Relationships are illustrated by a phylogenetic • İlişkiler bir filogenetik ağaç / dendrogram ile
tree / dendrogram gösterilmektedir.
– Combination of Greek dendro/tree and – Yunanca dendro/ağaç ve gramma/çizim kelimelerinin
gramma/drawing
birleşimidir
– A dendrogram is a tree diagram
– Bir dendrogram bir ağaç diyagramıdır
• frequently used to illustrate the arrangement of the
clusters produced by hierarchical clustering. • hiyerarşik kümeleme tarafından üretilen kümelerin düzenini
göstermek için sıklıkla kullanılır.
• Dendrograms are often used in computational
biology • Dendrogramlar sıklıkla hesaplamalı biyolojide
– to illustrate the clustering of genes or samples, kullanılır
sometimes on top of heatmaps. – bazen ısı haritalarının üstünde, genlerin veya
örneklerin kümelenmesini göstermek için kullanılır.
93 94

93 94

Tree Terminology Ağaç Terminolojisi


Operational taxonomic units (OTU) / Taxa Operasyonel taksonomik birimler(OTU) / Taxa
Internal nodes A İç düğümler A

B B

C C
Terminal nodes Son düğümler

D D
Sisters Kızkardeşler
Root E Kök E

F F

Branches Polytomy Dallar Politomi

95 96

95 96

Copyright 2000 N. AYDIN. All rights


reserved. 16
Hierarchical Clustering Hiyerarşik kümeleme
• Two main types of hierarchical clustering • İki ana hiyerarşik kümeleme türü vardır:
– Agglomerative (bottom-up): – Birleştirici (parçadan bütüne):
• Start with the points as individual clusters • Bireysel kümeler olarak noktalarla başlanır
• At each step, merge the closest pair of clusters until only • Her adımda, yalnızca bir küme (veya k küme) kalana kadar
one cluster (or k clusters) left en yakın küme çifti birleştirilir
– Divisive (top-down): – Bölücü (bütünden parçaya):
• Start with one, all-inclusive cluster • Her şey dahil tek bir kümeyle başlanır
• At each step, split a cluster until each cluster contains an • Her adımda, her küme ayrı bir nokta içerene kadar (veya k
individual point (or there are k clusters) küme olana kadar) bir küme bölünür.
• Traditional hierarchical algorithms use a • Geleneksel hiyerarşik algoritmalar bir benzerlik
similarity or distance matrix veya mesafe matrisi kullanır
– Merge or split one cluster at a time – Her seferinde bir küme birleştirilir veya bölünür
97 98

97 98

Agglomerative Clustering Algorithm Birleştirici Kümeleme Algoritması


• Key Idea: • Anahtar fikir:
– Successively merge closest clusters – En yakın kümeleri art arda birleştir
• Basic algorithm • Temel algoritma
1. Compute the proximity matrix 1. Yakınlık matrisini hesapla
2. Let each data point be a cluster 2. Her veri noktası bir küme olsun
3. Repeat 3. Tekrar et
4. Merge the two closest clusters 4. En yakın iki kümeyi birleştir
5. Update the proximity matrix 5. Yakınlık matrisini güncelle
6. Until only a single cluster remains 6. Sadece tek bir küme kalana kadar
• Key operation is the computation of the proximity • Anahtar işlem, iki kümenin yakınlığının
of two clusters hesaplanmasıdır
– Different approaches to defining the distance between – Kümeler arasındaki mesafeyi tanımlamaya yönelik
clusters distinguish the different algorithms farklı yaklaşımlar, farklı algoritmaları birbirinden ayırır
99 100

99 100

Steps 1 and 2 Adım 1 ve 2


• Start with clusters of individual points and a • Bireysel nokta kümeleri ve bir yakınlık matrisi
proximity matrix p1 p2 p3 p4 p5 ... ile başlayın p1 p2 p3 p4 p5 ...
p1 p1

p2 p2
p3 p3

p4 p4
p5 p5
. .
. .
. Proximity Matrix . Yakınlık Matrisi

... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12

101 102

101 102

Copyright 2000 N. AYDIN. All rights


reserved. 17
Intermediate Situation Ara Durum
• After some merging steps, we have some clusters • Bazı birleştirme adımlarından sonra, bazı kümelerimiz
C1 C2 C3 C4 C5 var. C1 C2 C3 C4 C5
C1 C1

C2 C2
C3 C3
C3 C3
C4 C4
C4 C4
C5 C5

Proximity Matrix Yakınlık Matrisi


C1 C1

C2 C5 C2 C5

... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12

103 104

103 104

Step 4 Adım 4
• We want to merge the two closest clusters (C2 and C5) • En yakın iki kümeyi (C2 ve C5) birleştirmek ve yakınlık
and update the proximity matrix. C1 C2 C3 C4 C5 matrisini güncellemek istiyoruz. C1 C2 C3 C4 C5
C1 C1

C2 C2
C3 C3
C3 C3
C4 C4
C4 C4
C5 C5

Proximity Matrix Yakınlık Matrisi


C1 C1

C2 C5 C2 C5

... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12

105 106

105 106

Step 5 Adım 5
• The question is “How do we update the proximity • Soru, “Yakınlık matrisini nasıl güncelleriz?”
matrix?” C2
U
C2
U
C1 C5 C3 C4 C1 C5 C3 C4

C1 ? C1 ?

C2 U C5 ? ? ? ? C2 U C5 ? ? ? ?
C3 C3
C3 ? C3 ?
C4 C4
C4 ? C4 ?

Proximity Matrix Yakınlık Matrisi


C1 C1

C2 U C5 C2 U C5

... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12

107 108

107 108

Copyright 2000 N. AYDIN. All rights


reserved. 18
How to Define Inter-Cluster Distance Kümeler Arası Uzaklık Nasıl Tanımlanır?
p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1
Similarity? Benzerlik?
p2 p2

p3 p3

p4 p4

p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
Proximity Matrix
• Distance Between Centroids • Merkezler Arası Uzaklık Yakınlık Matrisi
• Other methods driven by an objective • Bir amaç fonksiyonu tarafından
function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır

109 110

109 110

How to Define Inter-Cluster Similarity Kümeler Arası Uzaklık Nasıl Tanımlanır?


p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1

p2 p2

p3 p3

p4 p4

p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi

• Other methods driven by an objective • Bir amaç fonksiyonu tarafından


function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır

111 112

111 112

How to Define Inter-Cluster Similarity Kümeler Arası Uzaklık Nasıl Tanımlanır?


p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1

p2 p2

p3 p3

p4 p4

p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi

• Other methods driven by an objective • Bir amaç fonksiyonu tarafından


function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır

113 114

113 114

Copyright 2000 N. AYDIN. All rights


reserved. 19
How to Define Inter-Cluster Similarity Kümeler Arası Uzaklık Nasıl Tanımlanır?
p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1

p2 p2

p3 p3

p4 p4

p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi

• Other methods driven by an objective • Bir amaç fonksiyonu tarafından


function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır

115 116

115 116

How to Define Inter-Cluster Similarity Kümeler Arası Uzaklık Nasıl Tanımlanır?


p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1

  p2   p2

p3 p3

p4 p4

p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi

• Other methods driven by an objective • Bir amaç fonksiyonu tarafından


function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır

117 118

117 118

MIN or Single Link MIN veya Tek Bağlantı

• Proximity of two clusters is based on the two • İki kümenin yakınlığı, farklı kümelerdeki en
closest points in the different clusters yakın iki noktayı temel alır.
– Determined by one pair of points, i.e., by one link in – Bir nokta çifti tarafından, yani yakınlık grafiğindeki
the proximity graph bir bağlantı ile belirlenir
• Example: • Örnek: Altı noktanın xy koordinatları:
xy-coordinates of six points:

119 120

119 120

Copyright 2000 N. AYDIN. All rights


reserved. 20
MIN or Single Link MIN veya Tek Bağlantı
Euclidean distance matrix for six points: Altı nokta için Öklid uzaklık matrisi:

dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= min(0.15, 0.25, 0.28, 0.39) = min(0.15, 0.25, 0.28, 0.39)
= 0.15. = 0.15.

121 122

121 122

Hierarchical Clustering: MIN Hiyerarşik kümeleme : MIN

5 5
1 1
3 3
0.2
5 0.2
5
2 1 2 1
0.15 0.15
2 3 6 2 3 6
0.1
0.1

0.05
4 4 0.05
4 0
3 6 2 5 4 1 4
0
3 6 2 5 4 1

Nested Clusters: Single link Dendrogram İç İçe Kümeler: Altı noktanın Dendrogram
clustering of the six points tek bağlantı kümelemesi

123 124

123 124

Strength of MIN "MIN" in güçlü yönü

Original Points Six Clusters Orijinal noktalar 6 küme

• Can handle non-elliptical shapes • Eliptik olmayan şekilleri işleyebilir

125 126

125 126

Copyright 2000 N. AYDIN. All rights


reserved. 21
Limitations of MIN "MIN" in kısıtları

Two Clusters İki küme

Original Points Orijinal Noktalar

• Sensitive to noise • Gürültüye duyarlı


Three Clusters Üç küme

127 128

127 128

MAX or Complete Linkage MAX veya Tam Bağlantı

• Proximity of two clusters is based on the two • İki kümenin yakınlığı, farklı kümelerdeki en uzak
most distant points in the different clusters iki noktayı temel alır.
– Determined by all pairs of points in the two clusters – İki kümedeki tüm nokta çiftleri tarafından belirlenir

Distance Matrix: Uzaklık Matrisi:

129 130

129 130

MAX or Complete Linkage MAX veya Tam Bağlantı

• As with single link, points 3 and 6 are merged first. • Tek bağlantıda olduğu gibi, önce 3 ve 6 noktaları
• However, {3, 6} is merged with {4}, instead of {2, 5} or birleştirilir.
{1} because • Ancak {3, 6}, {2, 5} veya {1} yerine {4} ile birleştirilir,
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) çünkü
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4))
= max(0.15, 0.22) = max(0.15, 0.22)
= 0.22. = 0.22.
dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5),
dist(6, 5)) dist(6, 5))
= max(0.15, 0.25, 0.28, 0.39) = max(0.15, 0.25, 0.28, 0.39)
= 0.39. = 0.39.
dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))
= max(0.22, 0.23) = max(0.22, 0.23)
= 0.23. = 0.23.
131 132

131 132

Copyright 2000 N. AYDIN. All rights


reserved. 22
Hierarchical Clustering: MAX Hiyerarşik kümeleme : MAX

4 1 4 1
2 5 0.4 2 5 0.4

0.35 0.35
5 5
2 0.3
2 0.3

0.25 0.25

3 6 0.2 3 6 0.2

3 0.15 3 0.15
1 0.1 1 0.1

0.05 0.05
4 4
0 0
3 6 4 1 2 5 3 6 4 1 2 5

Nested Clusters Dendrogram İç içe kümeler Dendrogram

133 134

133 134

Strength of MAX "MAX" ın güçlü yönü

Original Points Two Clusters Orijinal Noktalar İki küme

• Less susceptible to noise • Gürültüye daha az duyarlı

135 136

135 136

Limitations of MAX "MAX" ın kısıtları

Original Points Two Clusters Orijinal Noktalar İki küme

• Tends to break large clusters • Büyük kümeleri kırma eğilimindedir


• Biased towards globular clusters • Küresel kümelere eğilimli

137 138

137 138

Copyright 2000 N. AYDIN. All rights


reserved. 23
Group Average Grup Ortalaması
• Proximity of two clusters is the average of pairwise proximity • İki kümenin yakınlığı, iki kümedeki noktalar arasındaki ikili
between points in the two clusters. yakınlığın ortalamasıdır.

 proximity(p , p ) i j  proximity(p , p )
piClusteri
i j
piClusteri
p jClusterj
proximity(Clusteri , Clusterj ) =
p jClusterj
proximity(Clusteri , Clusterj ) =
|Clusteri ||Clusterj | |Clusteri ||Clusterj |

Distance Matrix: Distance Matrix:

139 140

139 140

Group Average Grup Ortalaması


• To illustrate how group average works, we calculate the • Grup ortalamasının nasıl çalıştığını göstermek için bazı
distance between some clusters. kümeler arasındaki mesafeyi hesaplıyoruz.

dist({3, 6, 4}, {1}) = (0.22 + 0.37 + 0.23)/(3 × 1) dist({3, 6, 4}, {1}) = (0.22 + 0.37 + 0.23)/(3 × 1)
= 0.28 = 0.28
dist({2, 5}, {1}) = (0.24 + 0.34)/(2 × 1) dist({2, 5}, {1}) = (0.24 + 0.34)/(2 × 1)
= 0.29 = 0.29
dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3 × 2) dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3 × 2)
= 0.26 = 0.26

• Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, • dist({3, 6, 4}, {2, 5}), dist({3, 6, 4}, {1}) ve dist({2, 5},
4}, {1}) and dist({2, 5}, {1}), clusters {3, 6, 4} and {2, {1}) den daha küçük olduğu için, { 3, 6, 4} ve {2, 5}
5} are merged at the fourth stage. kümeleri dördüncü aşamada birleştirilir.
141 142

141 142

Hierarchical Clustering: Group Average Hiyerarşik Kümeleme: Grup Ortalaması

5 4 5 4
1 1
0.25 0.25
2 2
5 0.2 5 0.2
2 2
0.15 0.15

3 6 0.1
3 6 0.1

1 0.05
1 0.05

4 0
4 0
3 3 6 4 1 2 5
3 3 6 4 1 2 5

Nested Clusters: Group Dendrogram İç içe kümeler: Gösterilen altı Dendrogram


average clustering of the six noktanın grup ortalama
points shown kümelenmesi
143 144

143 144

Copyright 2000 N. AYDIN. All rights


reserved. 24
Hierarchical Clustering: Group Average Hierarchical Clustering: Group Average
• Compromise between Single and Complete Link • Tek ve Tam Bağlantı Arasında Uzlaşma

• Strengths • Güçlü yönü


– Less susceptible to noise – Gürültüye daha az duyarlı

• Limitations • Kısıtları
– Biased towards globular clusters – Küresel kümelere eğilimli

145 146

145 146

Cluster Similarity: Ward’s Method Küme Benzerliği: Ward Yöntemi


• Similarity of two clusters is based on the increase • İki kümenin benzerliği, iki küme
in squared error when two clusters are merged birleştirildiğinde kare hatasındaki artışa bağlıdır
– Similar to group average if distance between points is – Noktalar arasındaki uzaklığın karesi alındığında grup
distance squared ortalamasına benzer

• Less susceptible to noise • Gürültüye daha az duyarlı

• Biased towards globular clusters • Küresel kümelere eğilimli

• Hierarchical analogue of K-means • K-ortalamalarının hiyerarşik benzeri


– Can be used to initialize K-means – K-ortalamalarını başlatmak için kullanılabilir
147 148

147 148

Hierarchical Clustering: Comparison Hiyerarşik Kümeleme: Karşılaştırma


5 5
1 4 1 1 4 1
3 3
2 5 2 5
5 5 5 5
2 1 2 2 1 2
MIN MAX MIN MAX
2 3 6 3 2 3 6 3
6 6
3 3
1 1
4 4 4 4
4 4

5 5
1 5 4 1 5 4
1 1
2 2 2 2
5 Ward’s Method 5 5 Ward’s Method 5
2 2 2 2
3 6 Group Average 3 6 3 6 Group Average 3 6
3 3
4 1 1 4 1 1
4 4 4 4
3 3

149 150

149 150

Copyright 2000 N. AYDIN. All rights


reserved. 25
Hierarchical Clustering: Time and Space requirements Hiyerarşik Kümeleme: Zaman ve Uzay gereksinimleri

• O(N2) space since it uses the proximity matrix. • O(N2) uzayı çünkü yakınlık matrisini kullanıyor.
– N is the number of points. – N nokta sayısıdır.

• O(N3) time in many cases • Birçok durumda O(N3) süresi


– There are N steps and at each step the size, N2, – N adım vardır ve her adımda boyut, N2, yakınlık
proximity matrix must be updated and searched matrisi güncellenmeli ve aranmalıdır
– Complexity can be reduced to O(N2 log(N) ) time – Karmaşıklık, biraz zekice yaklaşımlarla
with some cleverness O(N2 log(N) ) zamanına indirgenebilir

151 152

151 152

Hierarchical Clustering: Problems and Limitations Hiyerarşik Kümeleme: Sorunlar ve Kısıtlar

• Once a decision is made to combine two clusters, • İki kümeyi birleştirme kararı verildikten sonra
it cannot be undone geri alınamaz

• No global objective function is directly • Hiçbir küresel amaç fonksiyonu doğrudan


minimized minimize edilmez

• Different schemes have problems with one or • Farklı tasarılarda aşağıdakilerden bir veya daha
more of the following: fazlasıyla ilgili sorunlar vardır :
– Sensitivity to noise – Gürültüye duyarlılık
– Difficulty handling clusters of different sizes and – Farklı boyutlardaki ve küresel olmayan şekillerdeki
non-globular shapes kümeleri işleme zorluğu
– Breaking large clusters – Büyük kümeleri kırma
153 154

153 154

Density Based Clustering Yoğunluk Tabanlı Kümeleme


• Clusters are regions of high density that are • Kümeler, düşük yoğunluklu bölgelerle
separated from one another by regions on low birbirinden ayrılan yüksek yoğunluklu
density. bölgelerdir.

155 156

155 156

Copyright 2000 N. AYDIN. All rights


reserved. 26
DBSCAN DBSCAN
• Density-Based Spatial Clustering of Applications • Density-Based Spatial Clustering of Applications
with Noise with Noise
• DBSCAN is a density-based algorithm. • DBSCAN, yoğunluğa dayalı bir algoritmadır.
– Density = number of points within a specified radius – Density (Yoğunluk) = belirli bir yarıçap içindeki nokta
(Eps) sayısı (Eps)
– A point is a core point if it has at least a specified – Bir nokta, Eps içinde en az belirli sayıda noktaya
number of points (MinPts) within Eps (MinPts) sahipse çekirdek noktadır
• These are points that are at the interior of a cluster • Bunlar bir kümenin iç kısmındaki noktalardır
• Counts the point itself • Noktanın kendisini sayar
– A border point is not a core point, but is in the – Bir sınır noktası, bir çekirdek nokta değildir, ancak bir
neighborhood of a core point çekirdek noktanın yakınındadır
– A noise point is any point that is not a core point or a – Gürültü noktası, çekirdek nokta veya sınır noktası
border point olmayan herhangi bir noktadır.

157 158

157 158

DBSCAN: Core, Border, and Noise Points DBSCAN: Çekirdek, Sınır ve Gürültü Noktaları

MinPts = 7 MinPts = 7

159 160

159 160

DBSCAN: Core, Border and Noise Points DBSCAN: Çekirdek, Sınır ve Gürültü Noktaları

Original Points Point types: core, Orijinal Noktalar Nokta türleri:


border and noise çekirdek, sınır ve
gürültü
Eps = 10, MinPts = 4 Eps = 10, MinPts = 4

161 162

161 162

Copyright 2000 N. AYDIN. All rights


reserved. 27
DBSCAN Algorithm DBSCAN Algoritması
• Form clusters using core points, and assign • Çekirdek noktaları kullanarak kümeler oluşturun
border points to one of its neighboring clusters ve komşu kümelerden birine sınır noktaları atayın

1. Label all points as core, border, or noise points. 1. Tüm noktaları çekirdek, sınır veya gürültü noktaları olarak
2. Eliminate noise points. etiketleyin.
3. Put an edge between all core points within a distance Eps of 2. Gürültü noktalarını ortadan kaldırın.
each other. 3. Birbirinden Eps mesafe içinde tüm çekirdek noktaların
4. Make each group of connected core points into a separate arasına bir kenar koyun.
cluster. 4. Birbirine bağlı çekirdek noktaların her grubunu ayrı bir küme
5. Assign each border point to one of the clusters of its haline getirin.
associated core points 5. Her sınır noktasını, ilişkili çekirdek noktalarının
kümelerinden birine atayın
163 164

163 164

When DBSCAN Works Well DBSCAN Ne Zaman İyi Çalışır

Original Points Clusters (dark blue points indicate noise) Orijinal Noktalar Kümeler (koyu mavi noktalar
gürültüyü gösterir)

• Can handle clusters of different shapes and sizes • Farklı şekil ve boyutlardaki kümeleri işleyebilir
• Resistant to noise • Gürültüye dayanıklı

165 166

165 166

When DBSCAN Does NOT Work Well DBSCAN Ne Zaman İyi Çalışmaz

Original Points Orijinal Points

167 168

167 168

Copyright 2000 N. AYDIN. All rights


reserved. 28
When DBSCAN Does NOT Work Well DBSCAN Ne Zaman İyi Çalışmaz

(MinPts=4, Eps=9.92). (MinPts=4, Eps=9.92).

Original Points Orijinal Noktalar

• Varying densities • Değişen yoğunluklar


• High-dimensional data • Yüksek boyutlu veriler
(MinPts=4, Eps=9.75) (MinPts=4, Eps=9.75)

169 170

169 170

DBSCAN: Determining EPS and MinPts DBSCAN: EPS ve MinPt'lerin Belirlenmesi

• Idea is that for points in a cluster, their kth nearest • Fikir, bir kümedeki noktalar için k'inci en yakın
neighbors are at close distance komşularının yakın mesafede olmasıdır.
• Noise points have the kth nearest neighbor at farther • Gürültü noktaları, daha uzak mesafedeki k. en
distance yakın komşuya sahiptir.
• So, plot sorted distance of every point to its kth • Böylece, her noktanın sıralanmış mesafesini k'inci
nearest neighbor en yakın komşusuna çizin

171 172

171 172

Cluster Validity Küme Geçerliliği


• For supervised classification we have a variety of measures to • Denetimli sınıflandırma için, modelimizin ne kadar iyi olduğunu
evaluate how good our model is değerlendirmek için çeşitli ölçümlerimiz var
– Accuracy, precision, recall – Doğruluk, kesinlik, hatırlama (Accuracy, precision, recall)

• For cluster analysis, the analogous question is how to evaluate the • Küme analizi için benzer soru, ortaya çıkan kümelerin “iyiliğinin”
“goodness” of the resulting clusters? nasıl değerlendirileceğidir?

• But “clusters are in the eye of the beholder”! • Ama “kümeler bakanın gözündedir”!
– In practice the clusters we find are defined by the clustering algorithm – Uygulamada bulduğumuz kümeler, kümeleme algoritması tarafından
tanımlanır
• Then why do we want to evaluate them?
– To avoid finding patterns in noise • O zaman neden onları değerlendirmek istiyoruz?
– To compare clustering algorithms – Gürültü içinde desen bulmaktan kaçınmak için
– To compare two sets of clusters – Kümeleme algoritmalarını karşılaştırmak için
– To compare two clusters – İki küme takımını karşılaştırmak için
– İki kümeyi karşılaştırmak için
173 174

173 174

Copyright 2000 N. AYDIN. All rights


reserved. 29
Clusters found in Random Data Rastgele Veride bulunan kümeler
1 1 1 1

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

Random 0.6 0.6 DBSCAN Rasgele 0.6 0.6 DBSCAN


Points 0.5 0.5 Noktalar 0.5 0.5
y

y
y

y
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x
1 1 1 1

0.9 0.9 0.9 0.9

K-means 0.8 0.8


Complete K- 0.8 0.8
Tam
0.7 0.7
Link ortlamalar 0.7 0.7
Bağlantı
0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5


y

y
0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x

175 176

175 176

Measures of Cluster Validity Küme Geçerliliği Ölçümleri


• Numerical measures that are applied to judge various • Küme geçerliliğinin çeşitli yönlerini değerlendirmek için
aspects of cluster validity, are classified into the uygulanan sayısal ölçümler, aşağıdaki iki tipte
following two types. sınıflandırılır.
– Supervised: Used to measure the extent to which cluster labels – Denetimli: Küme etiketlerinin harici olarak sağlanan sınıf
match externally supplied class labels. etiketleriyle ne ölçüde eşleştiğini ölçmek için kullanılır.
• Entropy • Entropi
– Often called external indices because they use information external to the data – Verilerin dışındaki bilgileri kullandıkları için genellikle harici indeksler olarak
adlandırılırlar
– Unsupervised: Used to measure the goodness of a clustering
– Denetimsiz: Dış bilgilere bakılmaksızın bir kümeleme
structure without respect to external information.
• Sum of Squared Error (SSE) yapısının iyiliğini ölçmek için kullanılır.
– Often called internal indices because they only use information in the data • Hatanın Karesi Toplamı (Sum of Squared Error, SSE)
– Yalnızca verilerdeki bilgileri kullandıkları için genellikle dahili dizinler olarak
• You can use supervised or unsupervised measures to adlandırılırlar

compare clusters or clusterings • Kümeleri veya kümelemeleri karşılaştırmak için


denetimli veya denetimsiz ölçümleri kullanabilirsiniz
177 178

177 178

Unsupervised Measures: Cohesion and Separation Denetimsiz Ölçümler: Uyum ve Ayırma

• Cluster Cohesion (compactness, tightness): • Küme Uyumu (kompaktlık, sıkılık) :


– Measures how closely related are objects in a cluster – Bir kümedeki nesnelerin ne kadar yakından ilişkili
• Example: SSE olduğunu ölçer
• Örnek: SSE
– Cohesion is measured by the within cluster sum of – Uyum, küme içi kareler toplamı (SSE) ile ölçülür
squares (SSE) 2 𝑆𝑆𝐸 = ෍ ෍ 𝑥 − 𝑚𝑖 𝑆𝑆𝐸 = ෍ ෍ 𝑥 − 𝑚𝑖 2
𝑖 𝑥∈𝐶𝑖
𝑖 𝑥∈𝐶𝑖
• Cluster Separation (isolation): • Küme Ayırma (izolasyon) :
– Measures how distinct or well-separated a cluster is – Bir kümenin diğer kümelerden ne kadar farklı veya iyi
from other clusters ayrılmış olduğunu ölçer
• Example: Squared Error • Örnek: Squared Error
– Separation is measured by the between cluster sum of – Ayırma, kümeler arası kareler toplamı ile ölçülür
squares 𝑆𝑆𝐵 = ෍ 𝐶𝑖 𝑚 − 𝑚𝑖 2
𝑆𝑆𝐵 = ෍ 𝐶𝑖 𝑚 − 𝑚𝑖 2
𝑖
𝑖
where 𝐶𝑖 is the size of cluster i burada 𝐶𝑖 , i kümesinin boyutudur

179 180

179 180

Copyright 2000 N. AYDIN. All rights


reserved. 30
Unsupervised Measures: Cohesion and Separation Denetimsiz Ölçümler: Uyum ve Ayırma

• Example: SSE • Örnek: SSE


– SSB + SSE = constant – SSB + SSE = constant
m m
     
1 m1 2 3 4 m2 5 1 m1 2 3 4 m2 5

K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 2 + 2−3 2 + 4−3 2 + 5−3 2 = 10 K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 2 + 2−3 2 + 4−3 2 + 5−3 2 = 10
𝑆𝑆𝐵 = 4 × 3 − 3 2 = 0 𝑆𝑆𝐵 = 4 × 3 − 3 2 = 0
𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10 𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10
K=2 clusters: 𝑆𝑆𝐸 = 1 − 1.5 2 + 2 − 1.5 2 + 4 − 4.5 2 + 5 − 4.5 2 = 1 K=2 clusters: 𝑆𝑆𝐸 = 1 − 1.5 2 + 2 − 1.5 2 + 4 − 4.5 2 + 5 − 4.5 2 = 1
𝑆𝑆𝐵 = 2 × 3 − 1.5 2 + 2 × 4.5 − 3 2 =9 𝑆𝑆𝐵 = 2 × 3 − 1.5 2 + 2 × 4.5 − 3 2 =9
𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10 𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10

181 182

Unsupervised Measures: Cohesion and Separation Denetimsiz Ölçümler: Uyum ve Ayırma

• A proximity graph-based approach can also be • Uyum ve ayrılma için yakınlık grafiğine dayalı
used for cohesion and separation. bir yaklaşım da kullanılabilir.
– Cluster cohesion is the sum of the weight of all links – Küme uyumu, bir küme içindeki tüm bağlantıların
within a cluster. ağırlıklarının toplamıdır.
– Cluster separation is the sum of the weights between – Küme ayrımı, kümedeki düğümler ile küme
nodes in the cluster and nodes outside the cluster. dışındaki düğümler arasındaki ağırlıkların
toplamıdır..

cohesion separation uyum ayırma


183 184

183 184

Unsupervised Measures: Silhouette Coefficient Denetimsiz Ölçümler: Karaltı Katsayısı


• Silhouette coefficient combines ideas of both cohesion and • Karaltı katsayısı, bireysel noktaların yanı sıra kümeler ve
separation, but for individual points, as well as clusters and kümelenmeler için hem uyum hem de ayrılma fikirlerini
clusterings birleştirir
• For an individual point, i • Bireysel bir i noktası için,
– Calculate a = average distance of i to the points in its cluster – "a = i'nin kümesindeki noktalara olan ortalama uzaklığı"nı hesaplayın
– Calculate b = min (average distance of i to points in another cluster) – "b = min (i'nin başka bir kümedeki noktalara olan ortalama uzaklığı)"nı
– The silhouette coefficient for a point is then given by hesaplayın
Distances used – Bir nokta için karaltı katsayısı o zaman sonra şu şekilde verilir:
s = (b – a) / max(a,b) i
to calculate b
Distances used
s = (b – a) / max(a,b) to calculate b
Distances used i
– Value can vary between -1 and 1 to calculate a
Distances used
– Typically ranges between 0 and 1. – Değer -1 ile 1 arasında değişebilir to calculate a

– The closer to 1 the better. – Tipik olarak 0 ile 1 arasında değişir.


– Can calculate the average silhouette coefficient for a cluster or a – 1'e ne kadar yakınsa o kadar iyidir.
clustering – Bir küme veya kümeleme için ortalama karaltı katsayısını hesaplayabilir
185 186

185 186

Copyright 2000 N. AYDIN. All rights


reserved. 31
Measuring Cluster Validity Via Correlation Korelasyon Yoluyla Küme Geçerliliğini Ölçme
• Two matrices • İki matris
– Proximity Matrix – Yakınlık Matrisi
– Ideal Similarity Matrix – İdeal Benzerlik Matris
• One row and one column for each data point • Her veri noktası için bir satır ve bir sütun
• An entry is 1 if the associated pair of points belong to the same cluster • İlişkili nokta çifti aynı kümeye aitse, giriş 1'dir
• An entry is 0 if the associated pair of points belongs to different clusters • İlişkili nokta çifti farklı kümelere aitse, giriş 0'dır

• Compute the correlation between the two matrices • İki matris arasındaki korelasyonu hesaplayın
– Since the matrices are symmetric, only the correlation between – Matrisler simetrik olduğundan, sadece n(n-1)/2 girişi arasındaki korelasyonun
n(n-1)/2 entries needs to be calculated. hesaplanması gerekir.
• High magnitude of correlation indicates that points that belong • Yüksek korelasyon değeri, aynı kümeye ait noktaların birbirine
to the same cluster are close to each other. yakın olduğunu gösterir.
– Correlation may be positive or negative depending on whether the – Korelasyon, benzerlik matrisinin benzerlik veya benzemezlik matrisi
similarity matrix is a similarity or dissimilarity matrix olmasına bağlı olarak pozitif veya negatif olabilir
• Not a good measure for some density or contiguity-based • Bazı yoğunluk veya bitişiklik tabanlı kümeler için iyi bir ölçü
clusters. değildir.
187 188

187 188

Measuring Cluster Validity Via Correlation Korelasyon Yoluyla Küme Geçerliliğini Ölçme

• Correlation of ideal similarity and proximity • Aşağıdaki iyi kümelenmiş veri setinin K-
matrices for the K-means clusterings of the ortalama kümelemeleri için ideal benzerlik ve
following well-clustered data set. yakınlık matrislerinin korelasyonu.

1 1 1 1

0.9 10 0.9 0.9 10 0.9

0.8 20 0.8 0.8 20 0.8

0.7 30 0.7 0.7 30 0.7

0.6 40 0.6 0.6 40 0.6


Points

Points

0.5 50 0.5 0.5 50 0.5


y

0.4 60 0.4 0.4 60 0.4

0.3 70 0.3 0.3 70 0.3

0.2 80 0.2 0.2 80 0.2

0.1 90 0.1 0.1 90 0.1

0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points

Corr = 0.9235 Corr = 0.9235

189 190

189 190

Korelasyon Yoluyla Küme Geçerliliğini Ölçme Korelasyon Yoluyla Küme Geçerliliğini Ölçme

• Correlation of ideal similarity and proximity • Aşağıdaki rasgele veri setinin K-ortalama
matrices for the K-means clusterings of the kümelemeleri için ideal benzerlik ve yakınlık
following random data set. matrislerinin korelasyonu.

1 1 1 1

0.9 10 0.9 0.9 10 0.9

0.8 20 0.8 0.8 20 0.8

0.7 30 0.7 0.7 30 0.7

0.6 40 0.6 0.6 40 0.6


Points

Points

0.5 50 0.5 0.5 50 0.5


y

0.4 60 0.4 0.4 60 0.4

0.3 70 0.3 0.3 70 0.3

0.2 80 0.2 0.2 80 0.2

0.1 90 0.1 0.1 90 0.1

0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points

K-means Corr = 0.5810 K-means Corr = 0.5810

191 192

191 192

Copyright 2000 N. AYDIN. All rights


reserved. 32
Judging a Clustering Visually by its Similarity Matrix Bir Kümeyi Benzerlik Matrisine Göre Görsel Olarak Değerlendirmek

• Order the similarity matrix with respect to cluster labels • Benzerlik matrisini küme etiketlerine göre sıralayın ve
and inspect visually. görsel olarak inceleyin.

1 1
1 1
10 0.9 10 0.9
0.9 0.9
20 0.8 20 0.8
0.8 0.8
30 0.7 30 0.7
0.7 0.7
40 0.6 40 0.6
0.6 0.6
Points

Points
50 0.5 50 0.5
0.5 0.5
y

y
60 0.4 60 0.4
0.4 0.4
70 0.3 70 0.3
0.3 0.3
80 0.2 80 0.2
0.2 0.2
90 0.1 90 0.1
0.1 0.1
100 0 100 0
0 20 40 60 80 100 Similarity 0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Points Points
x x

193 194

193 194

Judging a Clustering Visually by its Similarity Matrix Bir Kümeyi Benzerlik Matrisine Göre Görsel Olarak Değerlendirmek

• Clusters in random data are not so crisp • Rastgele verilerdeki kümeler çok net değildir

1 1
1 1
0.9 10 0.9 10
0.9 0.9

0.8 20 0.8 0.8 20 0.8

0.7 30 0.7 0.7 30 0.7

0.6 40 0.6 0.6 40 0.6


Points

Points

0.5 0.5
y

50 0.5 50 0.5

0.4 60 0.4 0.4 60 0.4

0.3 70 0.3 0.3 70 0.3

0.2 80 0.2 0.2 80 0.2

0.1 90 0.1 0.1 90 0.1

0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points

DBSCAN DBSCAN

195 196

195 196

Judging a Clustering Visually by its Similarity Matrix Bir Kümeyi Benzerlik Matrisine Göre Görsel Olarak Değerlendirmek

1 1

0.9 0.9
1 500 1 500
0.8 0.8
2 6 2 6
0.7 0.7
1000 1000
3 0.6 3 0.6
4 4
1500 0.5 1500 0.5

0.4 0.4
2000 2000
0.3 0.3
5 5
0.2 0.2
2500 2500
0.1 0.1
7 7
3000 0 3000 0
500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000

DBSCAN DBSCAN

197 198

197 198

Copyright 2000 N. AYDIN. All rights


reserved. 33
Determining the Correct Number of Clusters Doğru Kümelerin Sayısını Belirleme
• SSE is good for comparing two clusterings or • SSE, iki kümelemeyi veya iki kümeyi
two clusters karşılaştırmak için iyidir
• SSE can also be used to estimate the number of • SSE, küme sayısını tahmin etmek için de
clusters kullanılabilir.
10 10

6 9 6 9

8 8
4 4
7 7

2 6 2 6
SSE

SSE
5 5
0 0
4 4
-2 3 -2 3

2 2
-4 -4
1 1
-6 0 -6 0
2 5 10 15 20 25 30 2 5 10 15 20 25 30
5 10 15 5 10 15
K K

199 200

199 200

Determining the Correct Number of Clusters Doğru Kümelerin Sayısını Belirleme

• SSE curve for a more complicated data set • Daha karmaşık bir veri seti için SSE eğrisi

1 1
2 6 2 6

3 3
4 4

5 5

7 7

SSE of clusters found using K-means K-ortalamalar kullanılarak bulunan


kümelerin SSE'si

201 202

201 202

Supervised Measures of Cluster Validity: Entropy and Purity Denetimli Küme Geçerliliği Ölçütleri: Entropi ve Saflık

203 204

203 204

Copyright 2000 N. AYDIN. All rights


reserved. 34
Assessing the Significance of Cluster Validity Measures Küme Geçerlilik Ölçütlerinin Öneminin Değerlendirilmesi

• Need a framework to interpret any measure. • Herhangi bir ölçütü yorumlamak için bir
– For example, if our measure of evaluation has the çerçeveye ihtiyaç vardır.
value, 10, is that good, fair, or poor? – Örneğin, değerlendirme ölçütümüz 10 değerine
sahipse, bu iyi mi, adil mi yoksa kötü mü?
• Statistics provide a framework for cluster
• İstatistik, küme geçerliliği için bir çerçeve
validity
sağlar
– The more “atypical” a clustering result is, the more – Bir kümeleme sonucu ne kadar "atipik" ise,
likely it represents valid structure in the data verilerde geçerli yapıyı temsil etme olasılığı o
– Compare the value of an index obtained from the kadar yüksektir
given data with those resulting from random data. – Verilen verilerden elde edilen bir indeksin değerini
• If the value of the index is unlikely, then the cluster rastgele verilerden elde edilenlerle karşılaştırın.
results are valid • "Dizin"in değeri olası değilse, küme sonuçları geçerlidir
205 206

205 206

Statistical Framework for SSE SSE için İstatistiksel Çerçeve


• Example • Örnek
– Compare SSE of three cohesive clusters against three clusters in random – Üç uyumlu kümenin SSE'sini rastgele verilerdeki üç kümeyle karşılaştırın
data
1 1
50 50
0.9 0.9
45 45
0.8 0.8
40 40
0.7 0.7
35 35
0.6 0.6
30 30
Count

Count

0.5 0.5
y

25 25
0.4 0.4
20 20
0.3 0.3
15 15
0.2 0.2
10 10
0.1 0.1
5 5
0 0
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE x SSE

SSE = 0.005 Histogram shows SSE of three clusters in 500 SSE = 0.005 Histogram, x ve y değerleri için 0,2 – 0,8
sets of random data points of size 100 distributed aralığında dağıtılan 100 büyüklüğünde 500 rasgele
over the range 0.2 – 0.8 for x and y values veri noktasındaki üç kümenin SSE'sini gösterir.
207 208

207 208

Statistical Framework for Correlation SSE için İstatistiksel Çerçeve


• Correlation of ideal similarity and proximity matrices for • Aşağıdaki iki veri setinin K-ortalama kümelemeleri için
the K-means clusterings of the following two data sets. ideal benzerlik ve yakınlık matrislerinin korelasyonu.

1 1 1 1

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5


y

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x

Histogram of correlation 0,2 ile 0,8 arasındaki


Corr = -0.9235 Corr = -0.5810 for 500 random data sets of Corr = -0.9235 Corr = -0.5810 noktaların x ve y değerleri
Correlation is negative because it is calculated size 100 with x and y Korelasyon, bir uzaklık matrisi ile ideal benzerlik ile 100 boyutlu 500 rasgele
between a distance matrix and the ideal similarity values of points between matrisi arasında hesaplandığından negatiftir. veri kümesi için korelasyon
matrix. Higher magnitude is better. 0.2 and 0.8. Daha yüksek değer daha iyidir. histogramı.
209 210

209 210

Copyright 2000 N. AYDIN. All rights


reserved. 35
Final Comment on Cluster Validity Küme Geçerliliğine İlişkin Son Yorum

• “The validation of clustering structures is the • "Kümeleme yapılarının doğrulanması,


most difficult and frustrating part of cluster kümeleme analizinin en zor ve sinir bozucu
analysis. kısmıdır.
• Without a strong effort in this direction, cluster • Bu yönde güçlü bir çaba olmaksızın, küme
analysis will remain a black art accessible only analizi yalnızca deneyime ve büyük cesarete
to those true believers who have experience and sahip gerçek inananların erişebileceği kara bir
great courage.” sanat olarak kalacaktır.”
Algorithms for Clustering Data, Jain and Dubes Algorithms for Clustering Data, Jain and Dubes
• H. Xiong and Z. Li. Clustering Validation Measures. In C. C. Aggarwal and • H. Xiong and Z. Li. Clustering Validation Measures. In C. C. Aggarwal and
C. K. Reddy, editors, Data Clustering: Algorithms and Applications, pages C. K. Reddy, editors, Data Clustering: Algorithms and Applications, pages
571–605. Chapman & Hall/CRC, 2013. 571–605. Chapman & Hall/CRC, 2013.

211 212

211 212

Copyright 2000 N. AYDIN. All rights


reserved. 36
Data Mining Data Mining

Prof. Dr. Nizamettin AYDIN


Anomaly Detection
• Outline
– Characteristics of Anomaly Detection Problems
naydin@yildiz.edu.tr – Characteristics of Anomaly Detection Methods
– Statistical Approaches
– Proximity-based Approaches
http://www3.yildiz.edu.tr/~naydin – Clustering-based Approaches
– Reconstruction-based Approaches
– One-class Classification
– Information Theoretic Approaches
– Evaluation of Anomaly Detection
1 2

1 2

Veri Madenciliği Anomaly/Outlier Detection


• What are anomalies/outliers?
Anomali (Aykırılık) tespiti – The set of data points that are
considerably different than the
• Outline remainder of the data
– Anomali Tespit Problemlerinin Özellikleri
– Anomali Tespit Yöntemlerinin Özellikleri • Natural implication is that
– İstatistiksel Yaklaşımlar anomalies are relatively rare
– Yakınlık Temelli Yaklaşımlar – One in a thousand occurs often if you have lots of data
– Kümeleme Tabanlı Yaklaşımlar – Context is important, e.g., freezing temps in July
– Yenidenyapılandırma Temelli Yaklaşımlar
– Tek Sınıflı Sınıflandırma • Can be important or a nuisance
– Unusually high blood pressure
– Bilgi Teorisi Yaklaşımlar
– 100 kg, 2 year old
– Anomali Tespitinin Değerlendirilmesi
3 4

3 4

Anomali/Aykırı Değer Tespiti Importance of Anomaly Detection


• Anomaliler/aykırı değerler nelerdir? Ozone Depletion History
– Verilerin geri kalanından • In 1985 three researchers (Farman,
önemli ölçüde farklı olan Gardinar and Shanklin) were puzzled
by data gathered by the British
veri noktaları kümesi Antarctic Survey showing that ozone
levels for Antarctica had dropped
10% below normal levels
• Doğal çıkarım, anormalliklerin
nispeten nadir olduğudur • Why did the Nimbus 7 satellite, which
– Çok fazla veriniz varsa, binde bir sıklıkla görülür had instruments aboard for recording
• Bağlam önemlidir, örneğin Temmuz ayında dondurucu ozone levels, not record similarly low
soğuklar ozone concentrations?

• The ozone concentrations recorded by


• Önemli veya rahatsız edici olabilir the satellite were so low they were Source:

– Olağandışı yüksek tansiyon being treated as outliers by a http://www.epa.gov/ozone/science/hole/size.html


computer program and discarded!
– 2 yaşında, 100 kg
5 6

5 6

Copyright 2000 N. AYDIN. All rights


reserved. 1
Anomali Tespitinin Önemi Causes of Anomalies
Ozon Eksilmesinin Geçmişi
• 1985'te üç araştırmacı (Farman, Gardinar
• Data from different classes
ve Shanklin), British Antarktika – Measuring the weights of oranges, but a few
Araştırması tarafından toplanan ve
Antarktika'daki ozon seviyelerinin grapefruit are mixed in
normal seviyelerin %10 altına düştüğünü
gösteren veriler karşısında şaşırmıştı.

• Ozon seviyelerini kaydetmek için


• Natural variation
araçlara sahip olan Nimbus 7 uydusu – Unusually tall people
neden benzer şekilde düşük ozon
konsantrasyonları kaydetmedi?

• Uydu tarafından kaydedilen ozon • Data errors


konsantrasyonları o kadar düşüktü ki, bir
bilgisayar programı tarafından aykırı – 100 kg 2 year old
değerler olarak değerlendiriliyor ve Source:
atılıyordu! http://www.epa.gov/ozone/science/hole/size.html
7 8

7 8

Anomalilerin Nedenleri Distinction Between Noise and Anomalies

• Farklı sınıflardan veriler • Noise doesn’t necessarily produce unusual


– Portakalların ağırlıklarının ölçülmesi, ancak birkaç values or objects
greyfurt karışmış olabilir
• Noise is not interesting
• Doğal farklılıklar • Noise and anomalies are related but distinct
– Alışılmadık derecede uzun insanlar concepts

• Veri hataları
– 2 yaşında, 100 kg

9 10

9 10

Gürültü ve Anomali Arasındaki Ayrım Model-based vs Model-free

• Gürültü mutlaka olağandışı değerler veya • Model-based Approaches


nesneler üretmez – Model can be parametric or non-parametric
– Anomalies are those points that don’t fit well
• Gürültü (veri içerme bağlamında) ilginç – Anomalies are those points that distort the model
değildir • Model-free Approaches
• Gürültü ve anomali birbiriyle ilişkili ancak – Anomalies are identified directly from the data
farklı kavramlardır. without building a model
• Often the underlying assumption is that most of
the points in the data are normal

11 12

11 12

Copyright 2000 N. AYDIN. All rights


reserved. 2
Model Tabanlı ve Modelsiz General Issues
• Model tabanlı yaklaşımlar • Global vs. Local Perspective
– An instance can be identified as an anomaly by
– Model parametrik olabilir veya parametrik • building a model over all normal instances and using this global
olmayabilir model for anomaly detection
• by considering the local perspective of every data instance
– Anomaliler, modele iyi uymayan noktalardır – an anomaly detection approach is termed local if its output on a given
instance does not change if instances outside its local neighborhood
– Anomaliler, modeli bozan noktalardır are modified or removed

• Modelsiz yaklaşımlar • Label vs Score


– Some anomaly detection techniques provide only a binary
– Anomaliler, bir model oluşturmadan doğrudan categorization (anomali or normal)
verilerden tespit edilir – Other approaches measure the degree to which an object
is an anomaly
• Genellikle altta yatan varsayım, verilerdeki • This allows objects to be ranked
noktaların çoğunun normal olduğudur. • Scores can also have associated meaning (e.g., statistical
significance)
13 14

13 14

Genel Sorunlar Anomaly Detection Techniques


• Küresel ve Yerel Perspektif • Statistical Approaches
– Bir örnek şu şekilde anomali olarak tanımlanabilir:
• tüm normal örnekler üzerinde bir model oluşturmak ve bu genel • Proximity-based
modeli anomali tespiti için kullanmak
• her veri örneğinin yerel perspektifini göz önünde bulundurarak – Anomalies are points far away from other points
– Belirli bir örnek üzerindeki çıktısı, yerel komşuluğu dışındaki örnekler
değiştirildiğinde veya kaldırıldığında değişmiyorsa, bu anomali tespiti • Clustering-based
yaklaşımı yerel olarak adlandırılır.
– Points far away from cluster centers are outliers
• Etiket ve Puan Karşılaştırması
– Bazı anomali algılama yöntemleri yalnızca ikili bir – Small clusters are outliers
sınıflandırma sağlar. • Reconstruction Based
– Diğer yaklaşımlar, bir nesnenin anomalilik derecesini
ölçer – rely on the assumption that the normal class resides in
• Bu, nesnelerin sıralanmasını sağlar a space of lower dimensionality than the original
• Puanlar aynı zamanda ilgili bir anlama da sahip olabilir (ör. space of attributes
istatistiksel anlamlılık)
15 16

15 16

Anomali Tespit Yöntemleri Statistical Approaches


• Probabilistic definition of an outlier:
• İstatistiksel yaklaşımlar – An outlier is an object that has a low probability with respect to a
• Yakınlık tabanlı yaklaşımlar probability distribution model of the data.
– Anomaliler, diğer noktalardan uzaktaki noktalardır • Usually assume a parametric model describing the
distribution of the data (e.g., normal distribution)
• Kümeleme tabanlı yaklaşımlar • Apply a statistical test that depends on
– Küme merkezlerinden uzaktaki noktalar aykırı – Data distribution
değerlerdir – Parameters of distribution (e.g., mean, variance)
– Küçük kümeler aykırı değerlerdir – Number of expected outliers (confidence limit)
• Issues
• Yeniden yapılandırma tabanlı yaklaşımlar
– Identifying the distribution of a data set
– normal sınıfın, orijinal nitelikler uzayından daha • Heavy tailed distribution
düşük boyutlu bir uzayda bulunduğu varsayımına – Number of attributes
dayanır – Is the data a mixture of distributions?
17 18

17 18

Copyright 2000 N. AYDIN. All rights


reserved. 3
İstatistiksel Yaklaşımlar Boxplot
• Bir aykırı değerin olasılıksal tanımı : • This simplest possible box plot displays
the full range of variation (from min to
– Aykırı değer, verilerin olasılık dağılım modeline göre düşük max), the likely range of variation (the
olasılığa sahip bir nesnedir.. IQR), and a typical value (the median).
– Genellikle verilerin dağılımını açıklayan parametrik bir • Not uncommonly real datasets will
display surprisingly high maximums or
model varsayılır (ör. normal dağılım) surprisingly low minimums called
outliers.
• Veri dağılımına, dağılım parametrelerine (ör. ortalama,
varyans), beklenen aykırı değerlerin sayısına (güven • John Tukey has provided a precise
sınırı) bağlı olan istatistiksel bir test uygulayın definition for two types of outliers:
– Outliers are either 3×IQR or more above
• Sorunlar the third quartile or 3×IQR or more below
the first quartile.
– Bir veri setinin dağılımını belirleme – Suspected outliers are slightly more central versions of outliers:
• either 1.5×IQR or more above the third quartile
– Öznitelik sayısı – (Q3 + 1.5 x IQR)
– Verilerin, dağılımların bir karışımı olup olmaması? • or 1.5×IQR or more below the first quartile
– (Q1-1.5 x IQR)

19 20

19 20

Kutu grafiği Boxplot


• Bu mümkün olan en basit kutu çizimi, • If either type of outlier is
tüm değişim aralığını (min'den maks'a),
olası değişim aralığını (IQR) ve tipik present
değeri (medyan) görüntüler. – the whisker on the
• Gerçek veri kümeleri, aykırı değerler appropriate side is taken to
olarak adlandırılan şaşırtıcı derecede 1.5×IQR from the quartile
yüksek maksimumlar veya şaşırtıcı (the "inner fence") rather
derecede düşük minimumlar
gösterecektir. than the max or min,
• John Tukey, iki tür aykırı değer için • individual outlying data
kesin bir tanım sağlamıştır: points are displayed as
– Aykırı değerler, üçüncü çeyreğin 3×IQR veya
daha fazla üzerinde veya ilk çeyreğin 3×IQR – unfilled circles for
veya daha fazla altındadır.. suspected outliers
– Şüpheli aykırı değerler, aykırı değerlerin biraz daha merkeze yakın versiyonlarıdır:
• ya 1,5×IQR ya da üçüncü çeyreğin üzerinde – or filled circles for
– (Q3 + 1.5 x IQR) outliers.
• veya ilk çeyreğin altında 1,5×IQR veya daha fazla
– (Q1-1.5 x IQR) • The "outer fence" is 3×IQR from the quartile.
21 22

21 22

Kutu grafiği Normal Distributions


• Herhangi bir aykırı değer
türü mevcutsa One-dimensional
– uygun taraftaki bıyık, Gaussian
maks veya min yerine
çeyrekten ("iç sınır")
1,5×IQR'ye alınır, 8

• bireysel dış veri noktaları 7


0.1

şu şekilde görüntülenir:
6
0.09
5
0.08
4

– şüpheli aykırı değerler


0.07
3
0.06
Two-dimensional
için boş daireler
2
0.05 Gaussian
y

0 0.04

– veya aykırı değerler için -1

-2
0.03

0.02

dolu daireler -3 0.01

• "Dış sınır", çeyrekten 3×IQR'dir.


-4
probability
-5 density

-4 -3 -2 -1 0 1 2 3 4 5
x

23 24

23 24

Copyright 2000 N. AYDIN. All rights


reserved. 4
Normal Dağılım Grubbs’ Test
• Detect outliers in univariate data
Tek boyutlu
Gaussian • Assume data comes from normal distribution
• Detects one outlier at a time, remove the outlier,
8
and repeat
7

6
0.1

0.09
– H0: There is no outlier in data
– HA: There is at least one outlier
5
0.08
4

İki boyutlu max X − X


0.07
3

• Grubbs’ test statistic:


0.06

G=
2
0.05 Gaussian
y

-1
0.04

0.03
s
-2 0.02
2
-3 0.01
( N − 1) t ( / N , N −2 )
-4
probability
• Reject H0 if: G
N − 2 + t (2 / N , N − 2 )
-5 density

-4 -3 -2 -1 0
x
1 2 3 4 5 N
25 26

25 26

Grubbs Testi Statistically-based – Likelihood Approach

• Tek değişkenli verilerde aykırı değerleri algılama • Assume the data set D contains samples from a
• Verilerin normal dağılımdan geldiğini varsayalım mixture of two probability distributions:
– M (majority distribution)
• Her seferinde bir aykırı değeri algılar, aykırı – A (anomalous distribution)
değeri kaldırır ve tekrar eder
• General Approach:
– H0: Verilerde aykırı değer yok – Initially, assume all the data points belong to M
– HA: En az bir aykırı değer var – Let Lt(D) be the log likelihood of D at time t
max X − X
• Grubbs'un test istatistiği: G= – For each point xt that belongs to M, move it to A
s • Let Lt+1 (D) be the new log likelihood.
( N − 1) t (2 / N , N − 2 ) • Compute the difference,  = Lt(D) – Lt+1 (D)
• H0 ı reddet, eğer: G • If  > c (some threshold), then xt is declared as an
N N − 2 + t (2 / N , N − 2 )
anomaly and moved permanently from M to A
27 28

27 28

İstatistik tabanlı – Olasılık Yaklaşımı

• Assume the data set D contains samples from a


mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
• General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference,  = Lt(D) – Lt+1 (D)
• If  > c (some threshold), then xt is declared as an
anomaly and moved permanently from M to A
29 30

29 30

Copyright 2000 N. AYDIN. All rights


reserved. 5
Statistically-based – Likelihood Approach Strengths/Weaknesses of Statistical Approaches

• Data distribution, D = (1 – ) M +  A • Firm mathematical foundation


• M is a probability distribution estimated from • Can be very efficient
data • Good results if distribution is known
– Can be based on any modeling method (naïve Bayes, • In many cases, data distribution may not be
maximum entropy, etc.) known
• A is initially assumed to be uniform distribution • For high dimensional data, it may be difficult to
estimate the true distribution
• Likelihood at time t:
N    • Anomalies can distort the parameters of the
Lt ( D ) =  PD ( xi ) =  (1 −  )|M t |  PM t ( xi )  |At |  PAt ( xi ) 
i =1  xi M t  xiAt  distribution
LLt ( D ) = M t log(1 −  ) +  log PM t ( xi ) + At log  +  log PAt ( xi )
xi M t xi At
31 32

31 32

Distance-Based Approaches One Nearest Neighbor - One Outlier


• The outlier score of an object is the distance to its D 2

kth nearest neighbor 1.8

1.6

1.4

1.2

0.8

0.6

0.4

33 34

33 34

One Nearest Neighbor - Two Outliers Five Nearest Neighbors - Small Cluster
2
0.55

D D
0.5 1.8

0.45
1.6

0.4
1.4

0.35
1.2
0.3

1
0.25

0.2 0.8

0.15 0.6

0.1
0.4

0.05

35 36

35 36

Copyright 2000 N. AYDIN. All rights


reserved. 6
Five Nearest Neighbors - Differing Density Strengths/Weaknesses of Distance-Based Approaches

D 1.8
• Simple
1.6

1.4
• Expensive – O(n2)

• Sensitive to parameters
1.2

0.8
• Sensitive to variations in density
0.6

0.4
• Distance becomes less meaningful in high-
0.2
dimensional space
37 38

37 38

Density-Based Approaches Relative Density


• Density-based Outlier: • Consider the density of a point relative to that of
– The outlier score of an object is the inverse of the its k nearest neighbors
density around the object. • Let 𝑦1 , … , 𝑦𝑘 be the 𝑘 nearest neighbors of 𝒙
– Can be defined in terms of the k nearest neighbors 1 1
– One definition: 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝒙, 𝑘 = =
𝑑𝑖𝑠𝑡 𝒙,𝑘 𝑑𝑖𝑠𝑡(𝒙,𝒚𝑘 )
• Inverse of distance to kth neighbor
σ𝑘
𝑖=1 𝑑𝑒𝑛𝑠𝑖𝑡𝑦(𝒚𝑖 ,𝑘)/𝑘
– Another definition: 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝒙, 𝑘 =
𝑑𝑒𝑛𝑠𝑖𝑡𝑦(𝒙,𝑘)
• Inverse of the average distance to k neighbors
𝑑𝑖𝑠𝑡(𝒙,𝑘) 𝑑𝑖𝑠𝑡(𝒙,𝒚)
– DBSCAN definition = =
σ𝑘
𝑖=1 𝑑𝑖𝑠𝑡(𝒚𝑖 ,𝑘)/𝑘 σ𝑘
𝑖=1 𝑑𝑖𝑠𝑡(𝒚𝑖 ,𝑘)/𝑘
• If there are regions of different density, this • Can use average distance instead
approach can have problems
39 40

39 40

Relative Density Outlier Scores Relative Density-based: LOF approach


• For each point, compute the density of its local neighborhood
6.85
• Compute local outlier factor (LOF) of a sample p as the
6
C average of the ratios of the density of sample p and the
5
density of its nearest neighbors
• Outliers are points with largest LOF value
1.40 4
D

In the NN approach, p2 is
1.33
2
not considered as outlier,
A
while LOF approach find
1 
p2 both p1 and p2 as outliers
p1

41 42

41 42

Copyright 2000 N. AYDIN. All rights


reserved. 7
Strengths/Weaknesses of Density-Based Approaches Clustering-Based Approaches
• Simple • An object is a cluster-based outlier
if it does not strongly belong to
any cluster
• Expensive – O(n2) – For prototype-based clusters, an
object is an outlier if it is not close
• Sensitive to parameters enough to a cluster center
• Outliers can impact the clustering
produced

• Density becomes less meaningful in high- – For density-based clusters, an object


is an outlier if its density is too low
dimensional space • Can’t distinguish between noise and
outliers
– For graph-based clusters, an object is
an outlier if it is not well connected

43 44

43 44

Distance of Points from Closest Centroids Relative Distance of Points from Closest Centroid
4
4.5
4.6

4 3.5

C
3.5 3

3
2.5

2.5
2
D 0.17
2
1.5
1.5

1
1.2 1

A 0.5
0.5

Outlier Score Outlier Score


45 46

45 46

Strengths/Weaknesses of Clustering-Based Approaches Reconstruction-Based Approaches


• Simple • Based on assumptions there are patterns in the
distribution of the normal class that can be
• Many clustering techniques can be used captured using lower-dimensional representations
• Reduce data to lower dimensional data
• Can be difficult to decide on a clustering – E.g. Use Principal Components Analysis (PCA) or
technique Auto-encoders
• Measure the reconstruction error for each object
• Can be difficult to decide on number of clusters – The difference between original and reduced
dimensionality version

• Outliers can distort the clusters


47 48

47 48

Copyright 2000 N. AYDIN. All rights


reserved. 8
Reconstruction Error Reconstruction of two-dimensional data
• Let 𝐱 be the original data object
• Find the representation of the object in a lower
dimensional space
• Project the object back to the original space
• Call this object 𝐱ො

Reconstruction Error(x)= x − xො
• Objects with large reconstruction errors are
anomalies
49 50

49 50

Basic Architecture of an Autoencoder Strengths and Weaknesses


• An autoencoder is a multi-layer neural network • Does not require assumptions about distribution
• The number of input and output neurons is equal of normal class
to the number of original attributes.
• Can use many dimensionality reduction
approaches

• The reconstruction error is computed in the


original space
– This can be a problem if dimensionality is high

51 52

51 52

One Class SVM How Does One-Class SVM Work?


• Uses an SVM approach to classify normal • Uses the “origin” trick
objects • Use a Gaussian kernel
– Every point mapped to a unit hypersphere
• Uses the given data to construct such a model

• This data may contain outliers – Every point in the same orthant (quadrant)

• But the data does not contain class labels


• Aim to maximize the distance of the separating
• How to build a classifier given one class? plane from the origin
53 54

53 54

Copyright 2000 N. AYDIN. All rights


reserved. 9
Two-dimensional One Class SVM Equations for One-Class SVM
• Equation of hyperplane
• 𝜙 is the mapping to high dimensional space
• Weight vector is
• ν is fraction of outliers
• Optimization condition is the following

55 56

55 56

Finding Outliers with a One-Class SVM Finding Outliers with a One-Class SVM
• Decision boundary with 𝜈 = 0.1 • Decision boundary with 𝜈 = 0.05 and 𝜈 = 0.2

57 58

57 58

Strengths and Weaknesses Information Theoretic Approaches


• Strong theoretical foundation • Key idea is to measure how much information
decreases when you delete an observation
• Choice of ν is difficult

• Computationally expensive • Anomalies should show higher gain

• Normal points should have less gain

59 60

59 60

Copyright 2000 N. AYDIN. All rights


reserved. 10
Information Theoretic Example Strengths and Weaknesses
• Survey of height and weight for 100 participants • Solid theoretical foundation

• Theoretically applicable to all kinds of data

• Difficult and computationally expensive to


implement in practice

• Eliminating last group give a gain of


2.08 − 1.89 = 0.19
61 62

61 62

Evaluation of Anomaly Detection Distribution of Anomaly Scores


• If class labels are present, then use standard • Anomaly scores should show a tail
evaluation approaches for rare class such as
precision, recall, or false positive rate
– FPR is also know as false alarm rate

• For unsupervised anomaly detection use


measures provided by the anomaly method
– E.g. reconstruction error or gain

• Can also look at histograms of anomaly scores.


63 64

63 64

65

65

Copyright 2000 N. AYDIN. All rights


reserved. 11

You might also like