You are on page 1of 62

Regulation-2020 Academic Year-2022-2023

DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING & DATA
MINING (REGULATION-2020)
(For V Semester B.Tech IT)
UNIT-I:INTODUCTION TO DATA
WAREHOUSING PROCESSING

Prepared by

P.PRAVEENKUMAR,AP/IT

1
Regulation-2020 Academic Year-2022-2023

SRI MANAKULA VINAYAGAR ENGINEERING COLLEGE


DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING AND DATA MINING
UNIT – I:INTRODUCTION TO DATA WAREHOUSING

LECTURE NOTES
Data Warehouse: Data Warehouse-basic Concepts-Modeling-Design and Usage-Implementation-Data
generalization by Attribute-Oriented induction approach-Data Cube Computation methods.

DATA WAREHOUSE:
A data warehouse is a collection of data marts representing historical data from different operations in
the company. This data is stored in a structure optimized for querying and data analysis as a data
warehouse.
BASIC CONCEPTS:
CHARACTERISTICS OF DATAWAREHOUSE:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata in
support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and
source B may have different ways of identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can hold all addresses associated
with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

A Three Tier Data Warehouse Architecture:

Tier-1:

2
Regulation-2020 Academic Year-2022-2023

:
The bottom tier is a warehouse database server that is almost always a relationaldatabase system. Back-end
tools and utilities are used to feed data into the bottomtier from operational databases or other external
sources (such as customer profileinformation provided by external consultants). These tools and utilities
performdataextraction, cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The data are extracted
using application programinterfaces known as gateways. A gateway is supported by the underlying DBMS
andallows client programs to generate SQL code to be executed at a server.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP)
model or a multidimensional OLAP.

OLAP model is an extended relational DBMS thatmaps operations on multidimensionaldata to


standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

3
Regulation-2020 Academic Year-2022-2023

Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).
Benefits of data warehousing
 Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
 The structure of data warehouses is easier for end users to navigate, understand
and query against unlike the relational databases primarily designed to handle
lots of transactions.
 Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against inventory
data even if they were originally stored in different databases with different
structures.
 Queries that would be complex in very normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on transaction
systems.
 Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
 Data warehousing is an efficient way to manage demand for lots of information
from lots of users.
 Data warehousing provides the capability to analyze large amounts of historical
Operational and informational Data
Operational Data:
 Focusing on transactional function such as bank card withdrawalsand deposits
 Detailed
 Updateable
 Reflects current data
• Informational Data:
 Focusing on providing answers to problems posed by decisionmakers
 Summarized
 Non updateable

Differences between Operational Database Systems and Data Warehouses


Features of OLTP and OLAP

The major distinguishing features between OLTP and OLAP are summarized as follows.

1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too detailed to be

4
Regulation-2020 Academic Year-2022-2023

easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal with
information that originates from different organizations, integrating information from many data
stores. Because of their huge volume, OLAP data are stored on multiple storage media.

4. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries

Difference between OLTP and OLAP

5
Regulation-2020 Academic Year-2022-2023

Multidimensional Data Model. The most popular data model for data warehouses is a multidimensional
model. This model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
Let's have a look at each of these schema types.

Tables and Spreadsheets to Data Cubes

Stars,Snowflakes,and Fact Constellations:

Schemas for Multidimensional Databases


 Star schema: The star schema is a modeling paradigm in which the data warehouse
contains (1) a large central table (fact table), and (2) a set of smaller attendant tables
(dimension tables), one for each dimension. The schema graph resembles a starburst,
with the dimension tables displayed in a radial pattern around the central fact table.

Snowflake schema: The snowflake schema is a variant of the star schema model, where
 some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake. The major
difference between the snowflake and star schema models is that the dimension tables of
the snowflake model may be kept in normalized form. Such a table is easy to maintain
and also saves storage space because a large dimension table can be extremely large
when the dimensional structure is included as columns.

6
Regulation-2020 Academic Year-2022-2023

 Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.

Meta Data Repository:


Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine warehouse
objects. Metadata are created for the data names anddefinitions of the given warehouse. Additional
metadata are created and captured fortimestamping any extracted data, the source of the extracted data,
and missing fieldsthat have been added by data cleaning or integration processes.
A metadata repository should contain the following:
 A description of the structure of the data warehouse, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.
 Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
 The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation, summarization,and
predefined queries and reports.
 Data related to system performance, which include indices and profiles that improvedata access
and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and
replication cycles.

7
Regulation-2020 Academic Year-2022-2023

OLAP(Online analytical Processing):

 OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.


 OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
 OLAP tools enable users to analyze multidimensional data interactively frommultiple
perspectives

OLAP consists of three basic analytical operations:


 Consolidation (Roll-Up)
 Drill-Down
 Slicing And Dicing

 Consolidation involves the aggregation of data that can be accumulated and computed in one or more
dimensions. For example, all sales offices are rolled up to the sales department or sales division to
anticipate sales trends.

 The drill-down is a technique that allows users to navigate through the details. For instance,
users can view the sales by individual products that make up a region’s sales.

 Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the
OLAP cube and view (dicing) the slices from different viewpoints.

TYPES OF OLAP:

Relational OLAP (ROLAP):


 ROLAP works directly with relational databases. The base data and the dimension tables are
stored as relational tables and new tables are created to hold the aggregated information. It
depends on a specialized schema design.
 This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

 ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard
relational database and its tables in order to bring back the data required to answer the question.

Multidimensional OLAP (MOLAP):


 MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
 MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a

8
Regulation-2020 Academic Year-2022-2023

relational database. Therefore it requires the pre-computation and storage of information in the
cube - the operation known as processing.

 MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube
contains all the possible answers to a given range of questions.

 MOLAP tools have a very fast response time and the ability to quickly write backdata into the
data set.

Hybrid OLAP (HOLAP):


 There is no clear agreement across the industry as to what constitutes Hybrid OLAP, except that a
database will divide data between relational and specialized storage.

 For example, for some vendors, a HOLAP database will use relational tables to hold the larger
quantities of detailed data, and use specialized storage for at least some aspects of the smaller
quantities of more-aggregate or less-detailed data.
 HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of
both approaches.
 HOLAP tools can utilize both pre-calculated cubes and relational data sources.
DATA WAREHOUSE MODEL:
From the architecture point of view, there are three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse.

 Enterprise warehouse: An enterprise warehouse collects all of the information about


subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information providers, and is
cross-functional in scope. It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or
beyond.

 Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is connected to specific, selected subjects. For example,
a marketing data mart may connect its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized. Depending on the source of data, data
marts can be categorized into the following two classes:

(i).Independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
geographic area.
(ii).Dependent data marts are sourced directly from enterprise data warehouses.

 Virtual warehouse: A virtual warehouse is a set of views over operational databases. For

9
Regulation-2020 Academic Year-2022-2023

efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.
Figure: A recommended approach for data warehouse development.

DATA WAREHOUSE DESIGN PROCESS:


A data warehouse can be built using a top-down approach, a bottom-up approach, or a

combination of both.

 The top-down approach starts with the overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are
clear and well understood.

 The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of
business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the technology before making significant
commitments.

 In the combined approach, an organization can exploit the planned and strategic nature of the top-
down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.

The warehouse design process consists of the following steps:


 Choose a business process to model, for example, orders, invoices, shipments, inventory, account
administration, sales, or the general ledger. If the business process is organizational and involves
multiple complex object collections, a data warehouse model should be followed. However, if the
process is departmental and focuses on the analysis of one kind of business process, a data mart
model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of data to be
represented in the fact table for this process, for example, individual transactions, individual daily
snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.

10
Regulation-2020 Academic Year-2022-2023

 Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.

A CONCEPT HIERARCHY
A Concept hierarchy defines a sequence of mappings from a set of low level Concepts to higher
level, more general Concepts. Concept hierarchies allow data to be handled at varying levels of
abstraction

OLAP operations on multidimensional data.


1. Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing-up a
concept hierarchy for a dimension or by dimension reduction. Figure shows the result of a roll-up
operation performed on the central cube by climbing up the concept hierarchy for location. This
hierarchy was defined as the total order street < city < province or state <country.

2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping-down a concept hierarchy for a
dimension or introducing additional dimensions. Figure shows the result of a drill-down operation
performed on the central cube by stepping down a concept hierarchy for time defined as day <
month < quarter < year. Drill-down occurs by descending the time hierarchy from the level of quarter
to the more detailed level of month.

3. Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube. Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criteria time=‖Q2". The dice operation defines a sub
cube by performing a selection on two or more dimensions.

4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order to
provide an alternative presentation of the data. Figure shows a pivot operation where the item and
location axes in a 2-D slice are rotated.

Data warehouse Back-End Tools and Utilities


The ETL (Extract Transformation Load) process

This section we will discussed about the 3 major process of the data warehouse. They are extract
(data from the operational systems and bring it to the data warehouse), transform (the data into
internal format and structure of the data warehouse), cleanse (to make sure it is of sufficient quality
to be used for decision making) and load (cleanse data is put into the data warehouse).

EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be useful in the
decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the
relevant data from the operational database before bringing into the data warehouse. Many commercial tools
are available to help with the extraction process.

11
Regulation-2020 Academic Year-2022-2023

TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing with the
requirements. Therefore those who develop data warehouse based on these databases are typically faced
with inconsistency among their data sources. Transformation process deals with rectifying any
inconsistency (if any).

LOADING
Loading often implies physical movement of the data from the computer(s) storing the source database(s) to
that which will store the data warehouse database, assuming it is different. This takes place immediately
after the extraction phase. The most common channel for data movement is a high-speed communication
link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL
task on Oracle Data Warehouse.

Data Warehouse Usage:


Three kinds of data warehouse applications
1. Information processing
 supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and
graphs
2. Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
3. Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing classification and prediction,
and presenting the mining results using visualization tools.
 Differences among the three tasks
Data Warehouse Implementation
Efficient Computation of Data Cubes
Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L levels?

Materialization of data cube


 Materialize every (cuboid) (full materialization), none (no materialization), or some(partial
materialization)
 Selection of which cuboids to materialize
 Based on size, sharing, access frequency, etc.

Cube Operation
 Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales
 Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et
al.‘96)
SELECT item, city, year, SUM (amount) FROM SALES
CUBE BY item, city, year

12
Regulation-2020 Academic Year-2022-2023

 Need compute the following Group-Bys(date, product, customer),


(date,product),(date, customer), (product, customer),
(date), (product), (customer)
DATA CUBE COMPUTATION METHOD
Cube Computation: ROLAP-Based Method

 Efficient cube computation methods


o ROLAP-based cubing algorithms (Agarwal et al‘96)
o Array-based cubing algorithm (Zhao et al‘97)
o Bottom-up computation method (Bayer & Ramarkrishnan‘99)
 ROLAP-based cubing algorithms
o Sorting, hashing, and grouping operations are applied to the dimension attributes
in order to reorder and cluster related tuples
 Grouping is performed on some sub aggregates as a ―partial grouping step‖
 Aggregates may be computed from previously computed aggregates, rather than from the
base fact table

Multi-way Array Aggregation for Cube

 Computation
 Partition arrays into chunks (a small sub cube which fits in memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in ―multiway‖ by visiting cube cells in the order which minimizes
the # of times to visit each cell, and reduces memory access and storage cost.

13
Regulation-2020 Academic Year-
2022-2023

SRI MANAKULA VINAYAGAR ENGINEERING COLLEGE


DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING AND DATA MINING
UNIT – II: DATA MINING

LECTURE NOTES
Data Mining:Introduction-Kinds of Data and Patterns-Manjor issues in Data Mining-Data Objects
and attribute types-Statistical description of data-Measuring data Similarity and Dissililarity.Data
Preprocessing:Overview-Data Cleaning-Data Integration-Data Reduction-Data Transformation
and Discretization.
DATA MINING

Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually
a misnomer. Thus, data mining should have been more appropriately named as knowledge mining
which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for furtheruse.
The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business information
in a large database — for example, finding linked products in gigabytes of store scanner data — and
mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense
amount of material, or intelligently probing it to find exactly where the value resides. Given databases
of sufficient size and quality, data mining technology can generate new business opportunities by
providing these capabilities:

Automated prediction of trends and behaviors.


14
Regulation-2020 Academic Year-
2022-2023

Data mining automates the process of finding predictive information in large databases. Questions that
traditionally required extensive hands- on analysis can now be answered directly from the data —
quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on investment in future
mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and
identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns.
Data mining tools sweep through databases and identify previously hidden patterns in one step. An
example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products
that are often purchased together. Other pattern discovery problems include detecting fraudulent credit
card transactions and identifying anomalous data that could represent data entry keying errors.
Tasks of Data Mining

Data mining involves six common classes of tasks:


Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data
records, that might be interesting or data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches for relationships between variables.
For example a supermarket might gather data on customer purchasing habits. Using association rule
learning, the supermarket can determine which products are frequently bought together and use this
information for marketing purposes. This issometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some way or another
"similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For example, an e-
mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including visualization and
report generation.

ARCHITECTURE OF DATA MINING

A typical data mining system may have the following major components.

15
Regulation-2020 Academic Year-
2022-2023

KNOWLEDGE DISCOVERY PROCESS


 Data cleaning: to remove noise or irrelevant data
 Data integration: where multiple data sources may be combined
 Data selection: where data relevant to the analysis task are retrieved from the database
 Data transformation: where data are transformed or consolidated into forms appropriate for mining
by performing summary or aggregation operations
 Data mining: an essential process where intelligent methods are applied in order to extract data
patterns
 Pattern evaluation to identify the truly interesting patterns representing knowledge based on some
interestingness measures knowledge
 Presentation: where visualization and knowledge representation techniques are used to present the
mined knowledge to the user.

MAJOR ISSUES IN DATA MINING:


 Mining different kinds of knowledge in databases. - The need of different users is not the same. And
Different user may be in interested in different kind of knowledge. Therefore it is necessary for data
mining to cover broad range of knowledge discovery task

16
Regulation-2020 Academic Year-
2022-2023

 Interactive mining of knowledge at multiple levels of abstraction. - The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on returned results.

 Incorporation of background knowledge. - To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple level of abstraction.

 Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the
user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.

 Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be
expressed in high level languages, visual representations. This representations should be easily
understandable by the users.

 Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise,
incomplete objects while mining the data regularities. If data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.

 Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.

 Efficiency and scalability of data mining algorithms. - In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.

DATA OBJECTS AND ATTRIBUTE TYPES

17
Regulation-2020 Academic Year-
2022-2023

Data objects are the essential part of a database. A data object represents the entity. Data Objects are like
group of attributes of an entity. For example a sales data object may represent customer, sales or purchases.
When a data object is listed in a database they are called data tuples.
ATTRIBUTE
The attribute is the property of the object. The attribute represents different features of the object.
Example of attribute:
In this example, Roll_No, Name, and Result are attributes of the object named as a student.
Roll_No Name Result
1 Ali Pass
2 Akbar Fail
3 Antony Pass
Types Of attributes
 Binary
 Nominal
 Numeric
 Interval-scaled
 Ratio-scaled
Nominal data with example:
 Nominal data is in alphabetical form and not in an integer.
Example:
Attribute Value
Categorical data Lecturer, Assistant Professor, Professor
New, Pending, Working, Complete,
States
Finish
Colors Black, Brown, White, Red
Binary data with example:
 Binary data have only two values/states.

18
Regulation-2020 Academic Year-
2022-2023

Example:
Attribute Value
HIV detected Yes, No
Result Pass, Fail
The binary attribute is of two types:
 Symmetric binary
 Asymmetric binary
Symmetric data with example:

 Both values are equally important


Example:
Attribute Value
Gender Male, Female
Asymmetric data with example:

 Both values are not equally important


Example:
Attribute Value
HIV detected Yes, No
Result Pass, Fail
Ordinal data:
 All Values have a meaningful order.

Example:
Attribute Value
Grade A, B, C, D, F
BPS- Basic pay scale 16, 17, 18
Discrete Data with example:Discrete data have finite value. It can be in numerical form and can also be
in categorical form.
Example
Attribute Value
Teacher, Bussiness
Profession
Man, Peon etc
Postal Code 42200, 42300 etc

Continuous data with example:


 Continuous data technically have an infinite number of steps.
 Continuous data is in float type. There can be many numbers in between 1 and 2
Example:
Attribute Value
Height 5.4…, 6.5….. etc
19
Regulation-2020 Academic Year-
2022-2023

Weight 50.09…. etc


STATISTICAL DESCRIPTION OF DATA
Basic statistical descriptions can be used to identify properties of the data and highlight which data values
should be treated as noise or outliers.
 Measures of central tendency
 Dispersion of the data
 Graphical data presentation
Measures of central tendency
 Mean: Let X1, X2,….. XN be a set of N values or observations, such as for some numeric attribute X.
The mean of this set of values is

 Median: The median is taken as the average of the two middlemost values.

 Mode: The mode for a set of data is the value that occurs most frequently in the set. Data sets with
one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
 Midrange: The average of the largest and smallest values in the set.
Dispersion of the data
 Range: The range of the set is the difference between the largest (max()) and smallest (min())
values.
 Quartiles: Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
 Interquartile Range: The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the interquartile
range (IQR)
 Five-Number Summary: The five-number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
 Boxplots: Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-
number summary as follows:
 Typically, the ends of the box are at the quartiles so that the box length is the interquartile
range.
 The median is marked by a line within the box.
 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest
(Maximum) observations.
 Variance: The variance of N observations, x1,x2,...,xN , for a numeric attribute X is

20
Regulation-2020 Academic Year-
2022-2023

 Standard Deviation: The standard deviation, σ, of the observations is the square root of the
variance, σ2.
Graphical data presentation
 Quantile plots: A quantile plot is a simple and effective way to have a first look at a univariate
data distribution

 Quantile–Quantile plots: A quantile–quantile plot, or q-q plot, graphs the quantiles of one
univariate distribution against the corresponding quantiles of another.
 Histograms: A chart of poles. The height of the bar indicates the frequency

 Scatter plots: A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.

Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following

21
Regulation-2020 Academic Year-
2022-2023

steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular application domain. Hence, domain-
specific knowledge and experience are usually necessary in order to come up with a meaningful problem
statement. Unfortunately, many application studies tend to focus on the data-mining technique at the
expense of a clear problem statement. In this step, a modeler usually specifies a set of variables for the
unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. There
may be several hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it usually means a
close interaction between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire data-mining
process.
2. Collect the data

This step is concerned with how the data are generated and collected. In general, there are two distinct
possibilities. The first is when the data-generation process is under the control of an expert (modeler):
this approach is known as a designed experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the observational approach. An observational
setting, namely, random data generation, is assumed in most data-mining applications. Typically, the
sampling distribution is completely unknown after data are collected, or it is partially and implicitly given
in the data-collection procedure. It is very important, however, to understand how data collection affects
its theoretical distribution, since such a priori knowledge can be very useful for modeling and, later, for
the final interpretation of results. Also, it is important to make sure that the data used for estimating a
model and the data used later for testing and applying a model come from the same, unknown, sampling
distribution. If this is not the case, the estimated model cannot be successfully used in a final application
of the results.
3. Preprocessing the data
In the observational setting, data are usually "collected" from the existing databases, data warehouses,
and data marts. Data preprocessing usually includes at least two common tasks:
i. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with most
observations. Commonly, outliers result from measurement errors, coding and recording errors, and,

22
Regulation-2020 Academic Year-
2022-2023

sometimes, are natural, abnormal values. Such non representative samples can seriously affect the model
produced later. There are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.

ii. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as variable
scaling and different types of encoding. For example, one feature with the range [0, 1] and the other with
the range [−100, 1000] will not have the same weights in the applied technique; they will also influence
the final data-mining results differently. Therefore, it is recommended to scale them and bring both
features to the same weight for further analysis. Also, application-specific encoding methods usually
achieve
4. Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task in this phase.
This process is not straightforward; usually, in practice, the implementation is based on several models,
and selecting the best one is an additional task

5. Interpret the model and draw conclusions


In most cases, data-mining models should help in decision making. Hence, such models need to be
interpretable in order to be useful because humans are not likely to base their decisions on complex
"black-box" models. Note that the goals of accuracy of the model and accuracy of its interpretation are
somewhat contradictory. Usually, simple models are more interpretable, but they are also less accurate.
Modern data-mining methods are expected to yield highly accurate results using high dimensional
models.

MEASURING DATA SIMILARITY AND DISSIMILARITY


 Dissimilarity matrix (or object-by-object structure): This structure stores a collection of
proximities that are available for all pairs of n objects. It is often represented by an n-by-n table:

where d(i, j) is the measured dissimilarity or ―difference‖ between objects i and j


 Similarity between nominal data: sim(i, j) = 1−d(i, j)
 Dissimilarity between nominal data

23
Regulation-2020 Academic Year-
2022-2023

where m is the number of matches (i.e., the number of attributes for which i and j are in the same
state), and p is the total number of attributes describing the objects.
 Symmetric Dissimilarity between binary attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

 Dissimilarity of Numeric Data:


o Euclidean distance

o Manhattan distance

o Minkowski Distance

 Proximity Measures for Ordinal Attributes


 Dissimilarity for Attributes of Mixed Types

 Cosine similarity for vectors

 Cosine similarity for binary values

Descriptive Data Summarization


Categorize the measures
 A measure is distributive, if we can partition the dataset into smaller subsets, compute the measure
on the individual subsets, and then combine the partial results in order to arrive at the measure‘s
value on the entire (original) dataset
24
Regulation-2020 Academic Year-
2022-2023

 A measure is algebraic if it can be computed by applying an algebraic function to one or more


distributive measures
 A measure is holistic if it must be computed on the entire dataset as a whole
DATA PREPROCESSING:
Data Integration:
It combines data from multiple sources into a coherent data store, as in data warehousing. These
sources may include multiple databases, data cubes, or flat files.

The data integration systems are formally defined as triple<G,S,M>

Where G: The global schema

S:Heterogeneous source of schemas

M: Mapping between the queries of source and global schema

Data Transformation:

In data transformation, the data are transformed or consolidated into forms appropriate for
mining.

Data transformation can involve the following:

Smoothing, which works to remove noise from the data. Such techniques include
25
Regulation-2020 Academic Year-
2022-2023

binning, regression, and clustering.


Aggregation, where summary or aggregation operations are applied to the data.
For example, the daily sales data may be aggregated so as to compute monthly
and annual total amounts. This step is typically used in constructing a data cube
for analysis of the data at multiple granularities.
Generalization of the data, where low-level or ―primitive‖ (raw) data are
replaced by higher-level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher-level
concepts, like city or country.
Normalization, where the attribute data are scaled so as to fall within a small
specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction),where new attributes are
constructed and added from the given set of attributes to help the mining
process.

Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.

Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.

Numerosity reduction,where the data are replaced or estimated by alternative, smaller


data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes
are replaced by ranges or higher conceptual levels. Data discretization is a form of
numerosity reduction that is very useful for the automatic generation of concept
26
Regulation-2020 Academic Year-
2022-2023

hierarchies.Discretization and concept hierarchy generation are powerful tools for


datamining, in that they allow the mining of data at multiple levels of abstraction.
Data Transformation and Data Discretization
 Data are transformed or consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
 Data discretization, a form of data transformation.
Data Transformation Strategies Overview
 Smoothing
 Attribute Construction
 Aggregation
 Normalization
 Discretization
 Concept hierarchy generation for nominal data
Data Transformation by Normalization
 Normalizing the data attempts to give all attributes an equal weight.
 Min-max normalization

 Z-score normalization

 Normalization by decimal scaling

 Discretization by Binning: smoothing by bin means or smoothing by bin medians


 Discretization by Histogram Analysis: equal width, equal frequency, minimum interval size
o Discretization by Cluster, Decision Tree, and Correlation Analyses

Discretization by Cluster, Decision Tree, and Correlation Analyses


Concept Hierarchy Generation for Nominal Data
o Specification of a partial ordering of attributes explicitly at the schema level by users or experts
o Specification of a portion of a hierarchy by explicit data grouping
o Specification of a set of attributes, but not of their partial ordering
o Specification of only a partial set of attributes

27
Regulation-2020 Academic Year-
2022-2023

SRI MANAKULA VINAYAGAR ENGINEERING COLLEGE


DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING AND DATA MINING
UNIT – III: ASSOCIATION RULE MINING

LECTURE NOTES
Association Rule Mining:Basic Concepts-Frequent item set mining methods:Apriori Algorithm –
A pattern growth approach for mining frequent item sets-Pattern Evaluation methods-Mining
Multilevel,Multi-dimensional space constraint based frequent pattern mining

ASSOCIATION RULE MINING


 Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a
data set frequently. For example, a set of items, such as milk and bread, that appear frequently
together in a transaction data set is a frequent itemset.
 A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it
28
Regulation-2020 Academic Year-
2022-2023

occurs frequently in a shopping history database, is a (frequent) sequential pattern.


 A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences.
 If a substructure occurs frequently, it is called a (frequent) structured pattern.
EFFICIENT AND SCALABLE FREQUENT ITEMSET MINING METHODS
Mining Frequent Patterns
The method that mines the complete set of frequent itemsets with candidate
generation. Apriori property & The Apriori Algorithm.

Apriori property
 All nonempty subsets of a frequent item set most also be frequent.
 An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
 If an item A is added to the item set I then the resulting item set (I U A) can not
occur more frequently than I.
 Monotonic functions are functions that move in only one direction.
 This property is called anti-monotonic.
 If a set cannot pass a test, all its supersets will fail the same test as well.
 This property is monotonic in failing the test.

The Apriori Algorithm

 Join Step: Ck is generated by joining Lk-1with itself


 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent
k-itemset

29
Regulation-2020 Academic Year-
2022-2023

Mining Frequent Patterns Using FP-tree


 General idea (divide-and-conquer)
 Recursively grow frequent pattern path using the FP-tree Method
 For each item, construct its conditional pattern-base, and then its conditional
FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path (single path
willgenerate all the combinations of its sub-paths, each of which is a frequent
pattern)
30
Regulation-2020 Academic Year-
2022-2023

Principles of Frequent Pattern Growth

• Pattern growth property


– Let be a frequent itemset in DB, B be conditional pattern base,
and be anitemset in B. Then is a frequent itemset in DB
is frequent in B.
• ―abcdef ‖ is a frequent pattern, if and only if
– ―abcde ‖ is a frequent pattern, and
– ―f ‖ is frequent in the set of transactions containing ―abcde ‖

PATTERN MINING IN MULTILEVEL, MULTI-DIMENSIONAL SPACE , BASED FREQUENT


PATTERN MINING

Mining Multilevel Association Rules


For many applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those levels. Strong associations
discovered at high levels of abstraction may represent commonsense knowledge. Moreover, what
may represent common sense to one user may seem novel to another. Therefore, data mining
systems should provide capabilities for mining association rules at multiple levels of abstraction,
with sufficient flexibility for easy traversal among different abstraction spaces.

Mining multilevel association rules. Suppose given the task-relevant set of transactional data in
Table for sales in an AllElectronics store, showing the items purchased for each transaction. The
concept hierarchy for the items is shown in Figure 4. A concept hierarchy defines a sequence of
mappings from a set of low-level concepts to higher level, more general concepts. Data can be
generalized by replacing low-level concepts within the data by their higher-level concepts, or
ancestors, from a concept hierarchy.

31
Regulation-2020 Academic Year-
2022-2023

Association rules generated from mining data at multiple levels of abstraction are called multiple-level or
multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework. In general, a top-down strategy is employed, where counts are
accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level 1
and working downward in the hierarchy toward the more specific concept levels, until no more frequent
itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used, such as
Apriori or its variations.

 Using uniform minimum support for all levels (referred to as uniform

support): The same minimum support threshold is used when mining at each level of abstraction. For
example, in Figure 5, minimum support threshold of 5% is used throughout (e.g., for mining from
“computer” down to “laptop computer”). Both “computer” and “laptop computer” are found to be
frequent, while “desktop computer” is not. When a uniform minimum support threshold is used, the search
procedure is simplified. The method is also simple in that users are required to specify only one minimum
support threshold. An Apriori-like optimization technique can be adopted, based on the knowledge that an

32
Regulation-2020 Academic Year-
2022-2023

ancestor is a superset of its descendants: The search avoids examining itemsets containing any item whose
ancestors do not have minimum support.

uniform minimum support

 Using reduced minimum support at lower levels (referred to as reduced


support): Each level of abstraction has its own minimum support threshold. The deeper the level of
abstraction, the smaller the corresponding threshold is. For example, in Figure, the minimum support
thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop
computer,” and “desktop computer” are all considered frequent.

 Using item or group-based minimum support (referred to as group-based support): Because


users or experts often have insight as to which groups are more important than others, it is sometimes
more desirable to set up user-specific, item, or group based minimal support thresholds when mining
multilevel rules. For example, a user could set up the minimum support thresholds based on product
price, or on items of interest, such as by setting particularly low support thresholds for laptop computers
and flash drives in order to pay particular attention to the association patterns containing items in these
categories.
 Mining Multidimensional Association Rules from Relational Databases and
DataWarehouses
An association rules that imply a single predicate, that is, the predicate buys. For instance, in mining

our AllElectronics database, discover the Boolean association rule

33
Regulation-2020 Academic Year-
2022-2023

Following the terminology used in multidimensional databases, Each distinct predicate is referred in a
rule as a dimension. Hence, refer to Rule above as a single dimensional or intra dimensional
association rule because it contains a single distinct predicate (e.g., buys)with multiple occurrences
(i.e., the predicate occurs more than once within the rule). Such rules are commonly mined from
transactional data.

Considering each database attribute or warehouse dimension as a predicate, therefore mine association
rules containing multiple predicates, such as

 Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule above contains three predicates (age, occupation,
and buys), each of which occurs only once in the rule.
 Hence, it has no repeated predicates. Multidimensional association rules with no repeated
predicates are called inter dimensional association rules.
 Mine multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicates. These rules are called hybrid-dimensional association
rules. An example of such a rule is the following, where the predicate buys is repeated:

 Database attributes can be categorical or quantitative.

 Categorical attributes have a finite number of possible values, with no ordering among
the values (e.g., occupation, brand, color).
 Categorical attributes are also called nominal attributes, because their values are ―names
of things.
 Quantitative attributes are numeric and have an implicit ordering among values (e.g.,
age, income, price).
 Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes.
Constraint Based Frequent Pattern Mining
A data mining process may uncover thousands of rules from a given set of data, most of which end up
being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction of mining
may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find. Thus, a

34
Regulation-2020 Academic Year-
2022-2023

good heuristic is to have the users specify such intuition or expectations as constraints to confine the
search space. This strategy is known as constraint-based mining. The constraints can include the
following:

Knowledge type constraints: These specify the type of knowledge to be mined, such

as association or correlation.

Data constraints: These specify the set of task-relevant data.

Dimension/level constraints: These specify the desired dimensions the data, or levels of the

concept hierarchies, to be used in mining

Interestingness constraints: These specify thresholds on statistical measures of rule

interestingness, such as support, confidence, and correlation.

Rule constraints: These specify the form of rules to be mined. Such constraints may be

expressed as metarules (rule templates), as the maximum or minimum number of predicates that

can occur in the rule antecedent or consequent, or as relationships among attributes, attribute

values, and/or aggregates.

35
Regulation-2020 Academic Year-
2022-2023

SRI MANAKULA VINAYAGAR ENGINEERING COLLEGE


DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING AND DATA MINING
UNIT – IV: CLASSIFICATION

LECTURE NOTES
Classification:Basic Concepts-Decision Tree Induction-Bayes Classification Methods-Rule Based
Classification-Model Evaluation and Selection techniques to improve classification accuracy-
Support Vector Machine Classification using Frequent patterns-Other Classification Methods

CLASSIFICATION
The process of finding a model (or function) that describes and distinguishes data classes or concepts,
for the purpose of being able to use the model to predict the class of objects whose class label is
unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose
class label is known).
 It predicts categorical class labels
 It classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data
 Typical Applications
 Credit approval target marketing
 medical diagnosis
 treatment effectiveness analysis
DECISION TREE INDUCTION
 A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label.

36
Regulation-2020 Academic Year-
2022-2023

 The topmost node in the tree is the root node.The following decision tree is for the concept buy_computer
that indicates whether a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D.
Input:
 Data partition, D, which is a set of training tuples and their associated class labels;
 attribute_list, the set of candidate attributes;
 Attribute_selection method, a procedure to determine the splitting criterion that "best" par-titions the
data tuples into individual classes. This criterion consists of a splitting_attribute and, possibly, either a
split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute_selection method(D, attribute_list) to find the "best" splitting_criterion;
(7) label node N with splitting_criterion;
(8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary
trees
(9) attribute list attribute_list-splitting_attribute; // remove splitting_attribute
(10) for each outcome j of splitting_criterion // partition the tuples and grow subtrees for each
partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N;
endfor
(15) return N;

Basic algorithm for inducing a decision tree from training tuples.


 More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a
split-point or a splitting subset. The splitting criterion is determined so that, ideally, the resulting
partitions at each branch are as “pure” as possible. A partition is pure if all of the tuples in it belong to
the same class.
 In other words, if to split up the tuples in D according to the mutually exclusive outcomes of the
splitting criterion, hope for the resulting partitions to be as pure as possible.
 The node N is labeled with the splitting criterion, which serves as a test at the node (step 7). A branch is
grown from node N for each of the outcomes of the splitting criterion. The tuples in D are partitioned
accordingly (steps 10 to 11). There are three possible scenarios, as illustrated. Let A be the splitting
attribute. A has v distinct values, {a1, a2, . . . , av}, based on the training data.
o A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the
known values of A. A branch is created for each known value, aj, of A and labeled with that
value .Partition Dj is the subset of class-labeled tuples in D having value aj of

37
Regulation-2020 Academic Year-
2022-2023

A. Because all of the tuples in a given partition have the same value for A, then A need not be
considered in any future partitioning of the tuples. Therefore, it is removed from attribute_list
(steps 8 to 9).
o A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding
to the conditions A ≤ split_point and A ≥ split_point, respectively, where split_point is the split-
point returned by Attribute_selection_method as part of the splitting criterion. (In practice, the
split-point, a, is often taken as the midpoint of two known adjacent values of A and therefore
may not actually be a pre-existing value of A from the training data.) Two branches are grown
from N and labeled according to the above outcomes The tuples are partitioned such thatD1
holds the subset of class-labeled tuples in D for which A≤ split point, while D2 holds the rest.
o A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection
measure or algorithm being used): The test at node N is of the form “A ∈ SA?”. SA is the splitting
subset for A, returned by Attribute selection method as part of the splitting criterion. It is a subset
of the known values of A.
o If a given tuple has value aj of A and if aj ∈ SA, then the test at node N is satisfied. Two
branches are grown from N .By convention, the left branch out of N is labeled yes so that D1
corresponds to the subset of class-labeled tuples in D that satisfy the test. The right branch out of
N is labeled no so that D2 corresponds to the subset of class-labeled tuples from D that do not
satisfy the test.
 The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting
partition, Dj, of D (step 14).
 The recursive partitioning stops only when any one of the following terminating conditions is true:
 All of the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3), or
 There are no remaining attributes on which the tuples may be further partitioned (step 4). In this case,
majority voting is employed (step 5). This involves converting node N into a leaf and labeling it with the
most common class in D. Alternatively, the class distribution of the node tuples may be stored.
 There are no tuples for a given branch, that is, a partition D j is empty (step 12). In this case, a leaf is
created with the majority class in D (step 13).
 The resulting decision tree is returned (step 15)

Classification by Decision Tree Induction


 Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree
is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute,
each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.
The topmost node in a tree is the root node.

38
Regulation-2020 Academic Year-
2022-2023

A decision tree for the concept buys computer, indicating whether a customer at AllElectronics is likely to
purchase a computer.

 Each internal (non-leaf) node represents a test on an attribute. Each leaf node represents a class (either
buys computer = yes or buys computer = no).
 A typical decision tree is shown in Figure.It represents the concept buys computer, that is, it predicts
whether a customer at AllElectronics is likely to purchase a computer. Internal nodes are denoted by
rectangles, and leaf nodes are denoted by ovals.
 Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two
other nodes), whereas others can produce non-binary trees
 A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees
can easily be converted to classification rules.
Most algorithms for decision tree induction also follow such a top-down approach, which starts with a
training set of tuples and their associated class labels. The training set is recursively partitioned into
smaller subsets as the tree is being built. A basic decision tree algorithm is summarized in Figure 3.2.
 The algorithm is called with three parameters: D, attribute list, and Attribute selection method. Refer to D
as a data partition. Initially, it is the complete set of training tuples and their associated class labels.
 The parameter attribute list is a list of attributes describing the tuples. Attribute selection method specifies a
heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class.
 This procedure employs an attribute selection measure, such as information gain or the gini index.
 The tree starts as a single node, N, representing the training tuples in D (step 1).
 If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that class
(steps 2 and 3).
 Note that steps 4 and 5 are terminating conditions. All of the terminating conditions are explained at the
end of the algorithm.
 Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. The
splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or
partition the tuples in D into individual classes (step 6).
 The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of
the chosen test.
BAYESIAN CLASSIFICATION
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.Bayesian classification is based on Bayes’
theorem, described below.
Bayes’ Theorem:
 Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.

39
Regulation-2020 Academic Year-
2022-2023

 Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
 For classification problems, we want to determine P(H|X), the probability that the hypothesis H
holds given the ―evidence‖ or observed data tuple X.
 P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

A Multilayer Feed-Forward Neural Network:

 The backpropagation algorithm performs learning on a multilayer feed-forward neural network.

 It iteratively learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward
neural network consists of an input layer, one or more hiddenlayers, and an output layer.

 Example:

 The inputs to the network correspond to the attributes measured for each training tuple. The inputs are
fed simultaneously into the units making up the input layer. These inputs pass through the input layer
and are then weighted and fed simultaneously to a second layer known as a hidden layer.
 The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of
hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

40
Regulation-2020 Academic Year-
2022-2023

Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve
Bayesian classifier to be comparable in performance with decision tree and selected neural network
classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large
databases.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the
values of the other attributes. This assumption is called class conditional independence. It is made to
simplify the computations involved and, in this sense, is considered “naïve.” Bayesian belief networks are
graphical models, which unlike naïve Bayesian classifiers allow the representation of dependencies among
subsets of attributes. Bayesian belief networks can also be used for classification.

There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:
 In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
 A branch is created for each known value, aj, of A and labeled with that
value. Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
wheresplit point is the split-point returned by Attribute selection method as part of the
splitting criterion.
3 A is discrete-valued and a binary tree must be produced:

The test at node N is of the form―A€SA?‖.


SA is the splitting subset for A, returned by Attribute selection methodas part of the splitting
criterion. It is a subset of the known values of A.

41
Regulation-2020 Academic Year-
2022-2023

(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced
K-NEAREST-NEIGHBOR CLASSIFIER:

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing agiven
test tuplewith training tuples that are similar to it.
 The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional pattern
space. When given anunknown tuple, a k-nearest-neighbor classifier searches the pattern
space for the k trainingtuples that are closest to the unknown tuple. These k training tuples
are the k nearest neighbors of the unknown tuple.
 For k-nearest-neighbor classification, the unknown tuple is assigned the mostcommon class
among its k nearest neighbors.
 When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in
pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-
valuedprediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels
associatedwith the k nearest neighbors of the unknowntuple
42
Regulation-2020 Academic Year-
2022-2023

RULE BASED CLASSIFICATION


IF-THEN Rules
 Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yesTHEN buy_computer = yes
Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests and these tests are logically
ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
 R1: (age = youth) ^ (student = yes))(buys computer = yes)
43
Regulation-2020 Academic Year-
2022-2023

 If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
 To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.
 Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed
and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.

OTHER CLASSIFICATION METHODS:


Genetic Algorithms:

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, geneticlearning starts as
follows.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.
The rule ―IF A1 ANDNOT A 2 THENC 2‖ can be encoded as the bit string ―100,‖ where the
two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit represents
the class.

Similarly, the rule ―IF NOT A1 AND NOT A 2 THEN C1‖ can be encoded as ―001.‖
If an attribute has k values, where k > 2, then k bits may be used to encode theattribute’s values.
Classes can be encoded in a similar fashion.
44
Regulation-2020 Academic Year-
2022-2023

Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Typically, the fitness of a rule is assessed by its classification accuracy on a set of
training samples.

 Offspring are created by applying genetic operators such as crossover and mutation. In crossover,
substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly
selected bits in a rule’s string are inverted.
 The process of generating new populations based on prior populations of rules continues until a
population, P, evolves where each rule in P satisfies a pre specified fitness threshold.
 Genetic algorithms are easily parallelizable and have been used for classification as well as other
optimization problems. In data mining, they may be used to evaluate the fitness of other
algorithms.

Fuzzy Set Approaches:

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzyset.
Fuzzy logic systemstypically provide graphical tools to assist users in convertingattribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory. training set are described by two
Boolean attributes, A1 and A2, and that there are two classes,C1 andC2.
 The rule ―IF A1 ANDNOT A 2 THENC 2‖ can be encoded as the bit string ―100,‖ where
the two leftmost bits represent attributes A1 and A2 , respectively, and the rightmost bit
represents the class.
 Similarly, the rule ―IF NOT A1 AND NOT A 2 THEN C1‖ can be encoded as ―001.‖
 If an attribute has k values, where k > 2, then k bits may be used to encode theattribute’s
values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist ofthe fittest rules
in the current population, as well as offspring of these rules.
Typically, thefitness of a rule is assessed by its classification accuracy on a set
oftrainingsamples.
Offspring are created by applying genetic operators such as crossover and mutation.
45
Regulation-2020 Academic Year-
2022-2023

In crossover, substrings from pairs of rules are swapped to form new pairs of rules.

 The process of generating new populations based on prior populations of rules


continues until a population, P, evolves where each rule in P satisfies a pre specified
fitnessthreshold

46
Regulation-2020 Academic Year-
2022-2023

 Genetic algorithms are easily parallelizable and have been used for classification as well
as other optimization problems. In data mining, they may be used to evaluatethe fitness
of other algorithms.

Fuzzy Set Approaches:

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzyset.
Fuzzy logic systemstypically provide graphical tools to assist users in convertingattribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.
The response variable is what we want to predict.

Linear Regression:

Straight-line regression analysis involves a response variable, y, and a single predictor


variable x.
It is the simplest form of regression, and models y as a linear function of x.
That is,y = b+wx
where the variance of y is assumed to be constant
band w are regression coefficientsspecifying the Y-intercept and slope of the line.
 The regression coefficients, w and b, can also be thought of as weights, so that we
canequivalently write, y = w0+w1x
 These coefficients can be solved for by the method of least squares, which estimates
thebest-fitting straight line as the one that minimizes the error between the actual data
and the estimate of the line.
 Let D be a training set consisting of values of predictor variable, x, for some population
andtheir associated values for response variable, y. The training set contains |D| data points
of the form(x1, y1), (x2, y2), … , (x|D|, y|D|).
The regressioncoefficients can be estimated using this method with the following equations:

where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
47
Regulation-2020 Academic Year-
2022-2023

The coefficients w0 and w1 often provide good approximations to otherwise complicated


regression equations.
Multiple Linear Regression:

It is an extension of straight-line regression so as to involve more than one predictor


variable.
It allows response variable y to be modeled as a linear function of, say, n predictor
variables or attributes, A1, A2, …, An, describing a tuple, X.
An example of a multiple linear regression model basedon two predictor attributes or
variables, A1 and A2, isy = w0+w1x1+w2x2
where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
Multiple regression problemsare instead commonly solved with the use of statistical
software packages, such as SAS,SPSS, and S-Plus.

48
Regulation-2020 Academic Year-
2022-2023

SRI MANAKULA VINAYAGAR ENGINEERING COLLEGE


DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING AND DATA MINING
UNIT – V: CLUSTERING

LECTURE NOTES
Clustering:Clustering analysis-Partitioning methods-Hierarchical methods-Density based
methods-Grid based methods-Model Based Clustering Methods –Clustering High Dimensional
Data-Constraint based Clustering Analysis-Introduction to Outlier analysis-Data Mining
Applications.

CLUSTER ANALYSIS
Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.
Cluster: A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.
Clustering is also called data segmentation in some applications because clustering partitions large data sets into
groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that
are “far away” from any cluster) may be more interesting than common cases.
Clustering is a challenging field of research in which its potential applications pose their own special
requirements. The following are typical requirements of clustering in data mining:
 Scalability: Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to input certain parameters in cluster analysis (such as the number of desired
clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to
determine, especially for data sets containing high-dimensional objects. This not only burdens users, but
it also makes the quality of clustering difficult to control.
 Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or
erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
49
Regulation-2020 Academic Year-
2022-2023

 Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot
incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must
determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different clustering depending on
the order of presentation of the input objects. It is important to develop incremental clustering algorithms and
algorithms that are insensitive to the order of input.

Types of Data in Cluster Analysis


Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents,
countries, and so on. Main memory-based clustering algorithms typically operate on either of the following two
data structures.
Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also
called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of
a relational table, or n-by-p matrix (n objects × p variables):

(5.1)
Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available
for all pairs of n objects. It is often represented by an n-by-n table:

(5.2)

Where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a
nonnegative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes
larger the more they differ. Since d(i, j)=d( j, i), and d(i, i)=0, the matrix in (5.2).
The rows and columns of the data matrix represent different entities, while those of the dissimilarity
matrix represent the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the
dissimilarity matrix is called a one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If
the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before
applying such clustering algorithms.

50
Regulation-2020 Academic Year-
2022-2023

Major Clustering Methods:

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
Partitioning Methods:
A partitioning method constructs k partitions of the data, where each partition represents a cluster
and k <= n. That is, it classifies the data into k groups, which together satisfy the following
requirements:

Each group must contain at least one object, andEach object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation technique
that attempts to improve the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are close or related to
each other, whereas objects of different clusters are far apart or very different.

Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on howthe
hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each objectforming
a separate group. It successively merges the objects or groups that are closeto one another, until
all of the groups are merged into one or until a termination condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in the
same cluster. In each successiveiteration, a cluster is split up into smaller clusters, until
eventually each objectis in one cluster, or until a termination conditionholds.

51
Regulation-2020 Academic Year-
2022-2023
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:


 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in
Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given cluster,
the neighborhood of a given radius has to contain at least a minimum number of points.
Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary
shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.

 All of the clustering operations are performed on the grid structure i.e., onthe quantized
space. The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

52
Regulation-2020 Academic Year-
2022-2023
Model-Based Methods:
 Model-based methods hypothesize a model for each of the clusters and find the best fit of
the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.
Tasks in Data Mining:
 Clustering High-Dimensional Data
 Constraint-Based Clustering
Clustering High-Dimensional Data:
 It is a particularly important task in cluster analysis because many applications require the analysis of
objects containing a large number of features or dimensions
 For example, text documents may contain thousands of terms or keywords as features, and DNA micro
array data may provide information on the expression levels of thousands of genes under hundreds of
conditions.
 Clustering high-dimensional data is challenging due to the curseof dimensionality. Many dimensions
may not be relevant. As the number of dimensions increases, thedata become increasingly sparse so
that the distance measurement between pairs ofpoints become meaningless and the average density
of points anywhere in the data islikely to be low. Therefore, a different clustering methodology
needs to bedevelopedfor high-dimensional data.
 CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in
subspaces ofthe data, rather than over the entire data space.

 Frequent pattern–based clustering,another clustering methodology, extractsdistinct frequent patterns


among subsets ofdimensions that occur frequently. It uses such patterns to group objects and
generatemeaningful clusters

Constraint-BasedClustering:
 It is a clustering approach that performs clustering by incorporation of user-specified or application-
oriented constraints.

 A constraint expresses a user’s expectation or describes properties of the desired clustering results, and
provides an effective means for communicating with the clustering process.

 Various kinds of constraints can be specified, either by a user or as per application requirements.

 Spatial clustering employs with the existence of obstacles and clustering under user- specified
constraints. In addition, semi-supervised clusteringemploys forpairwise constraints in order to
53
Regulation-2020 Academic Year-
2022-2023
improvethe quality of the resulting clustering.

Classical Partitioning Methods:


The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
Centroid-Based Technique: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok clusters
so that the resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be
viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster mean or
center.
For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as


whereE is the sum of the square error for all objects in the data set pis the point in space representing a
given object
miis the mean of cluster Ci

54
Regulation-2020 Academic Year-
2022-2023
The k-means partitioning algorithm:

The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.

Clustering of a set of objects based on the k-means method


The k-Medoids Method:

 The k-means algorithm is sensitive to outliers because an object with an extremely large value may
substantially distort the distribution of data. This effect is particularly exacerbated due to the use of the
square-error function.

 Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to
represent the clusters, using one representative object per cluster. Each remaining object is clustered with the
representative object to which it is the most similar.

 Thepartitioning method is then performed based on the principle of minimizing the sum of the
dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion
is used, defined as

55
Regulation-2020 Academic Year-
2022-2023

whereE is the sum of the absolute error for all objects in the data set
pis the point inspace representing a given object in clusterCj
ojis the representative object of Cj
 The initial representative objects are chosen arbitrarily. The iterative process of replacing representative
objects by non representative objects continues as long as the quality of the resulting clustering is improved.
 This quality is estimated using a cost function that measures the average dissimilaritybetween an object and
the representative object of its cluster.
 To determine whether a non representative object, oj random, is a good replacement for a current
representativeobject, oj, the following four cases are examined for each of the nonrepresentative objects.
Case 1:
pcurrently belongs to representative object, o j. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by o randomasa representative object
and p is closest to o random, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by o randomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:

pcurrently belongs to representative object, oi, i≠j. If o jis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.

Four cases of the cost function for k-medoids clustering

Thek-MedoidsAlgorithm:
The k-medoids algorithm for partitioning based on medoid or central objects.

56
Regulation-2020 Academic Year-
2022-2023

The k-medoids method ismore robust than k-means in the presence of noise and outliers, because a
medoid is lessinfluenced by outliers or other extreme values than a mean. However, its processing ismore
costly than the k-means method.

57
Regulation-2020 Academic Year-
2022-2023
Hierarchical Clustering Methods:

 A hierarchical clustering method works by grouping data objects into a tree ofclusters.

 The quality of a pure hierarchical clusteringmethod suffers fromits inability to performadjustment once
amerge or split decision hasbeen executed. That is, if a particular merge or split decision later turns out
to have been apoor choice, the method cannot backtrack and correct it.

 Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending
on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.

Agglomerative hierarchical clustering:

 This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic
clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain
termination conditions are satisfied.

 Most hierarchical clustering methods belong to this category. They differ only in their definition of
intercluster similarity.

Divisive hierarchical clustering:

 This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with all objects
in one cluster.

 It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster on itsown or until it
satisfies certain termination conditions, such as a desired number ofclusters is obtained or the diameter of
each cluster is within a certain threshold.
Constraint-Based Cluster Analysis:

Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.


Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained clustering.
 Constraints on the selection of clustering parameters:
A user may like to set a desired range for each clustering parameter. Clustering parameters
58
Regulation-2020 Academic Year-
2022-2023
are usually quite specific to the given clustering algorithm. Examples of parameters includek,
the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.

 User-specified constraints on the properties of individual clusters:


A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled as
belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

Outlier Analysis
 Very often, there exist data objects that do not comply with the general behavior or model of the data. Such
data objects, which are grossly different from or inconsistent with the remaining set of data, are called
outliers. Outliers can be caused by measurement or execution error.
 Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This,
however, could result in the loss of important hidden information because one person’s noise could be
another person’s signal. In other words, the outliers may be of particular interest, such as in the case of
fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an
interesting data mining task, referred to as outlier mining.
 Outlier mining has wide applications. As mentioned previously, it can be used in fraud detection, for
example, by detecting unusual usage of credit cards or telecommunication services. In addition, it is useful
in customized marketing for identifying the spending behavior of customers with extremely low or

59
Regulation-2020 Academic Year-
2022-2023
extremely high incomes, or in medical analysis for finding unusual responses to various medical
treatments.
 Outlier mining can be described as follows: Given a set of n data points or objects and k, the expected
number of outliers, find the top k objects that are considerably dissimilar, exceptional, or inconsistent with
respect to the remaining data. The outlier mining problem can be viewed as two subproblems: (1) define
what data can be considered as inconsistent in a given data set, and (2) find an efficient method to mine the
outliers so defined.
 These can be categorized into four approaches: the statistical approach, the distance-based approach, the
density-based local outlier approach, and the deviation-based approach, each of which are studied here.
Notice that while clustering algorithms discard outliers as noise, they can be modified to include outlier
detection as a by-product of their execution. In general, users must check that each outlier discovered by
these approaches is indeed a “real” outlier.
Handling Outliers

 Applications that derive their data from measurements have an associated amount of noise, which can be
viewed as outliers.
 Alternately, outliers can be viewed as legitimate records having abnormal behavior. In general, clustering
techniques do not distinguish between the two: neither noise nor abnormalities fit into clusters.
Correspondingly, the preferable way to deal with outliers in partitioning the data is to keep one extra set of
outliers, so as not to pollute factual clusters. There are multiple ways of how descriptive learning handles
outliers.
Outlier Detection

Outliers are generally defined as models that are exceptionally far from the mainstream of data. There is no strict
mathematical definition of what alienation is; determining whether an observation is an abstraction is ultimately a
subjective exercise.

An outlier can be interpreted as data or observation that deviates greatly from the mean of a given protocol or set
of data. An exception may occur by accident, but it may indicate a measurement error or the given set of data may
have a heavier distribution.

Therefore, outlier detection can be defined as the process of detecting and then excluding outsiders from a given
set of data. There are no standardized outlier identification methods because these are mostly dataset-dependent.
Outlier detection as a branch of data processing has many applications in data stream analysis.

The problems of outlier detection by data stream and specific techniques used to detect streaming data in data
mining. We will also focus on recent research on outlier detection methods and external analysis.

Standard application of outlier detection such as fraud detection, public health, and sports, and touch on different
approaches such as proximity-based approaches and angle-based approaches.

Outlier Detection Techniques

60
Regulation-2020 Academic Year-
2022-2023
To identify the exterior in the database, it is important to keep in mind the context and find the answer to the most
basic and relevant question: “Why should I find outliers?” The context will explain the meaning of your findings.

Remember two important questions about your database during Outlier Identification:

(i) What and how many features do I consider for outlier detection? (Similarity/diversity)

(ii) Can I take the distribution (s) of values for the features I have selected? (Parameter / non-parameter)

There are four Outlier Detection techniques in general.

1. Numeric Outlier

A numerical outlier is a simple, non-standard outlier detection technique in a one-dimensional feature space.
Exteriors are calculated by IQR (InterQuartile Range). For example, the first and third quarters (Q1, Q3) are
calculated. Outlier is a data point xi that is out of range.

Using the interquartile amplifier value k=1.5, the limits are the typical upper and lower whiskers of a box plot.

This technique can be easily implemented on the KNIME Analytics platform using the Numeric Outliers node.

2. Z-Score

The Z-score technique considers the Gaussian distribution of data. Outliers are data points that are on the tail of
the distribution and are therefore far from average.

The z-score of any data point can be calculated by the following expression, after making appropriate changes to
the selected feature interval of the dataset:

When calculating the z-score for each sample, a limit must be specified in the data set. Some good ‘thumb rule’
limits may be fixed deviations of 2.5, 3, 3.5, or more.

3. DBSCAN

This outlier detection technique is based on the DBSCAN clustering method. DBSCAN is a non-standard, density-
based outlier detection method. Here, all data points are defined as focal points, boundary points, or noise points.

4. Isolated forest

This non-parameter system is suitable for large datasets in one or more dimensional features. Isolation number is
very important in this outlier detection technique. Isolation number is the number of divisions required to isolate a
data point.

Outlier Detection Methods

Models for Outlier Detection Analysis

61
Regulation-2020 Academic Year-
2022-2023
There are many approaches to detecting abnormalities. Outlier detection models can be classified into the
following groups:

1. Intensive value analysis

Extreme value analysis is the most basic form of outlier detection and is suitable for 1-dimensional data. In this
external analysis approach, the largest or smallest values are considered externally. The Z-Test and the Students’
T-Test are excellent examples.

These are good heuristics for the initial analysis of data but they are not of much value in multifaceted systems.
Extreme value analysis is often used as a final step in interpreting the outputs of other outlier detection methods.

2. Linear Models

In this approach, data is structured outside the lower dimensional substructure using linear interactions. The
distance of each data point is calculated for a plane that corresponds to the sub-interval. This distance is used to
detect outliers. PCA (primary component analysis) is an example of a linear model for anomaly detection.

3. Probabilistic and Statistical Models

In this approach, probability and statistical models consider specific distributions of data. Expectation-
enhancement (EM) methods are used to estimate the parameters of the sample. Finally, they calculate the
probability of the member of each data point for the calculated distribution. Points with the lowest probability of
membership are marked externally.

4. Proximity-based Models

In this mode, the outliers are designed as points of isolation from other observations. Cluster analysis, density-
based analysis, and neighborhood environment are key approaches of this type.

5. Information-theoretical models

In this mode, outliers increase the minimum code length to describe a data set.

62

You might also like