Professional Documents
Culture Documents
DEPARTMENT OF IT
U20ITT511 – DATA WAREHOUSING & DATA
MINING (REGULATION-2020)
(For V Semester B.Tech IT)
UNIT-I:INTODUCTION TO DATA
WAREHOUSING PROCESSING
Prepared by
P.PRAVEENKUMAR,AP/IT
1
Regulation-2020 Academic Year-2022-2023
LECTURE NOTES
Data Warehouse: Data Warehouse-basic Concepts-Modeling-Design and Usage-Implementation-Data
generalization by Attribute-Oriented induction approach-Data Cube Computation methods.
DATA WAREHOUSE:
A data warehouse is a collection of data marts representing historical data from different operations in
the company. This data is stored in a structure optimized for querying and data analysis as a data
warehouse.
BASIC CONCEPTS:
CHARACTERISTICS OF DATAWAREHOUSE:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata in
support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and
source B may have different ways of identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can hold all addresses associated
with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Tier-1:
2
Regulation-2020 Academic Year-2022-2023
:
The bottom tier is a warehouse database server that is almost always a relationaldatabase system. Back-end
tools and utilities are used to feed data into the bottomtier from operational databases or other external
sources (such as customer profileinformation provided by external consultants). These tools and utilities
performdataextraction, cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The data are extracted
using application programinterfaces known as gateways. A gateway is supported by the underlying DBMS
andallows client programs to generate SQL code to be executed at a server.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP)
model or a multidimensional OLAP.
3
Regulation-2020 Academic Year-2022-2023
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).
Benefits of data warehousing
Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
The structure of data warehouses is easier for end users to navigate, understand
and query against unlike the relational databases primarily designed to handle
lots of transactions.
Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against inventory
data even if they were originally stored in different databases with different
structures.
Queries that would be complex in very normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on transaction
systems.
Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information
from lots of users.
Data warehousing provides the capability to analyze large amounts of historical
Operational and informational Data
Operational Data:
Focusing on transactional function such as bank card withdrawalsand deposits
Detailed
Updateable
Reflects current data
• Informational Data:
Focusing on providing answers to problems posed by decisionmakers
Summarized
Non updateable
The major distinguishing features between OLTP and OLAP are summarized as follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
4
Regulation-2020 Academic Year-2022-2023
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal with
information that originates from different organizations, integrating information from many data
stores. Because of their huge volume, OLAP data are stored on multiple storage media.
4. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries
5
Regulation-2020 Academic Year-2022-2023
Multidimensional Data Model. The most popular data model for data warehouses is a multidimensional
model. This model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
Let's have a look at each of these schema types.
Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake. The major
difference between the snowflake and star schema models is that the dimension tables of
the snowflake model may be kept in normalized form. Such a table is easy to maintain
and also saves storage space because a large dimension table can be extremely large
when the dimensional structure is included as columns.
6
Regulation-2020 Academic Year-2022-2023
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.
7
Regulation-2020 Academic Year-2022-2023
Consolidation involves the aggregation of data that can be accumulated and computed in one or more
dimensions. For example, all sales offices are rolled up to the sales department or sales division to
anticipate sales trends.
The drill-down is a technique that allows users to navigate through the details. For instance,
users can view the sales by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the
OLAP cube and view (dicing) the slices from different viewpoints.
TYPES OF OLAP:
ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard
relational database and its tables in order to bring back the data required to answer the question.
8
Regulation-2020 Academic Year-2022-2023
relational database. Therefore it requires the pre-computation and storage of information in the
cube - the operation known as processing.
MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube
contains all the possible answers to a given range of questions.
MOLAP tools have a very fast response time and the ability to quickly write backdata into the
data set.
For example, for some vendors, a HOLAP database will use relational tables to hold the larger
quantities of detailed data, and use specialized storage for at least some aspects of the smaller
quantities of more-aggregate or less-detailed data.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of
both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data sources.
DATA WAREHOUSE MODEL:
From the architecture point of view, there are three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is connected to specific, selected subjects. For example,
a marketing data mart may connect its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized. Depending on the source of data, data
marts can be categorized into the following two classes:
(i).Independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
geographic area.
(ii).Dependent data marts are sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
9
Regulation-2020 Academic Year-2022-2023
efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.
Figure: A recommended approach for data warehouse development.
combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are
clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of
business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the technology before making significant
commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-
down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.
10
Regulation-2020 Academic Year-2022-2023
Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.
A CONCEPT HIERARCHY
A Concept hierarchy defines a sequence of mappings from a set of low level Concepts to higher
level, more general Concepts. Concept hierarchies allow data to be handled at varying levels of
abstraction
2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping-down a concept hierarchy for a
dimension or introducing additional dimensions. Figure shows the result of a drill-down operation
performed on the central cube by stepping down a concept hierarchy for time defined as day <
month < quarter < year. Drill-down occurs by descending the time hierarchy from the level of quarter
to the more detailed level of month.
3. Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube. Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criteria time=‖Q2". The dice operation defines a sub
cube by performing a selection on two or more dimensions.
4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order to
provide an alternative presentation of the data. Figure shows a pivot operation where the item and
location axes in a 2-D slice are rotated.
This section we will discussed about the 3 major process of the data warehouse. They are extract
(data from the operational systems and bring it to the data warehouse), transform (the data into
internal format and structure of the data warehouse), cleanse (to make sure it is of sufficient quality
to be used for decision making) and load (cleanse data is put into the data warehouse).
EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be useful in the
decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the
relevant data from the operational database before bringing into the data warehouse. Many commercial tools
are available to help with the extraction process.
11
Regulation-2020 Academic Year-2022-2023
TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing with the
requirements. Therefore those who develop data warehouse based on these databases are typically faced
with inconsistency among their data sources. Transformation process deals with rectifying any
inconsistency (if any).
LOADING
Loading often implies physical movement of the data from the computer(s) storing the source database(s) to
that which will store the data warehouse database, assuming it is different. This takes place immediately
after the extraction phase. The most common channel for data movement is a high-speed communication
link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL
task on Oracle Data Warehouse.
Cube Operation
Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et
al.‘96)
SELECT item, city, year, SUM (amount) FROM SALES
CUBE BY item, city, year
12
Regulation-2020 Academic Year-2022-2023
Computation
Partition arrays into chunks (a small sub cube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in ―multiway‖ by visiting cube cells in the order which minimizes
the # of times to visit each cell, and reduces memory access and storage cost.
13
Regulation-2020 Academic Year-
2022-2023
LECTURE NOTES
Data Mining:Introduction-Kinds of Data and Patterns-Manjor issues in Data Mining-Data Objects
and attribute types-Statistical description of data-Measuring data Similarity and Dissililarity.Data
Preprocessing:Overview-Data Cleaning-Data Integration-Data Reduction-Data Transformation
and Discretization.
DATA MINING
Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually
a misnomer. Thus, data mining should have been more appropriately named as knowledge mining
which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for furtheruse.
The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases
Data mining derives its name from the similarities between searching for valuable business information
in a large database — for example, finding linked products in gigabytes of store scanner data — and
mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense
amount of material, or intelligently probing it to find exactly where the value resides. Given databases
of sufficient size and quality, data mining technology can generate new business opportunities by
providing these capabilities:
Data mining automates the process of finding predictive information in large databases. Questions that
traditionally required extensive hands- on analysis can now be answered directly from the data —
quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on investment in future
mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and
identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns.
Data mining tools sweep through databases and identify previously hidden patterns in one step. An
example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products
that are often purchased together. Other pattern discovery problems include detecting fraudulent credit
card transactions and identifying anomalous data that could represent data entry keying errors.
Tasks of Data Mining
A typical data mining system may have the following major components.
15
Regulation-2020 Academic Year-
2022-2023
16
Regulation-2020 Academic Year-
2022-2023
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple level of abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the
user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be
expressed in high level languages, visual representations. This representations should be easily
understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise,
incomplete objects while mining the data regularities. If data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.
17
Regulation-2020 Academic Year-
2022-2023
Data objects are the essential part of a database. A data object represents the entity. Data Objects are like
group of attributes of an entity. For example a sales data object may represent customer, sales or purchases.
When a data object is listed in a database they are called data tuples.
ATTRIBUTE
The attribute is the property of the object. The attribute represents different features of the object.
Example of attribute:
In this example, Roll_No, Name, and Result are attributes of the object named as a student.
Roll_No Name Result
1 Ali Pass
2 Akbar Fail
3 Antony Pass
Types Of attributes
Binary
Nominal
Numeric
Interval-scaled
Ratio-scaled
Nominal data with example:
Nominal data is in alphabetical form and not in an integer.
Example:
Attribute Value
Categorical data Lecturer, Assistant Professor, Professor
New, Pending, Working, Complete,
States
Finish
Colors Black, Brown, White, Red
Binary data with example:
Binary data have only two values/states.
18
Regulation-2020 Academic Year-
2022-2023
Example:
Attribute Value
HIV detected Yes, No
Result Pass, Fail
The binary attribute is of two types:
Symmetric binary
Asymmetric binary
Symmetric data with example:
Example:
Attribute Value
Grade A, B, C, D, F
BPS- Basic pay scale 16, 17, 18
Discrete Data with example:Discrete data have finite value. It can be in numerical form and can also be
in categorical form.
Example
Attribute Value
Teacher, Bussiness
Profession
Man, Peon etc
Postal Code 42200, 42300 etc
Median: The median is taken as the average of the two middlemost values.
Mode: The mode for a set of data is the value that occurs most frequently in the set. Data sets with
one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
Midrange: The average of the largest and smallest values in the set.
Dispersion of the data
Range: The range of the set is the difference between the largest (max()) and smallest (min())
values.
Quartiles: Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
Interquartile Range: The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the interquartile
range (IQR)
Five-Number Summary: The five-number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
Boxplots: Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-
number summary as follows:
Typically, the ends of the box are at the quartiles so that the box length is the interquartile
range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest
(Maximum) observations.
Variance: The variance of N observations, x1,x2,...,xN , for a numeric attribute X is
20
Regulation-2020 Academic Year-
2022-2023
Standard Deviation: The standard deviation, σ, of the observations is the square root of the
variance, σ2.
Graphical data presentation
Quantile plots: A quantile plot is a simple and effective way to have a first look at a univariate
data distribution
Quantile–Quantile plots: A quantile–quantile plot, or q-q plot, graphs the quantiles of one
univariate distribution against the corresponding quantiles of another.
Histograms: A chart of poles. The height of the bar indicates the frequency
Scatter plots: A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
21
Regulation-2020 Academic Year-
2022-2023
steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular application domain. Hence, domain-
specific knowledge and experience are usually necessary in order to come up with a meaningful problem
statement. Unfortunately, many application studies tend to focus on the data-mining technique at the
expense of a clear problem statement. In this step, a modeler usually specifies a set of variables for the
unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. There
may be several hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it usually means a
close interaction between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire data-mining
process.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there are two distinct
possibilities. The first is when the data-generation process is under the control of an expert (modeler):
this approach is known as a designed experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the observational approach. An observational
setting, namely, random data generation, is assumed in most data-mining applications. Typically, the
sampling distribution is completely unknown after data are collected, or it is partially and implicitly given
in the data-collection procedure. It is very important, however, to understand how data collection affects
its theoretical distribution, since such a priori knowledge can be very useful for modeling and, later, for
the final interpretation of results. Also, it is important to make sure that the data used for estimating a
model and the data used later for testing and applying a model come from the same, unknown, sampling
distribution. If this is not the case, the estimated model cannot be successfully used in a final application
of the results.
3. Preprocessing the data
In the observational setting, data are usually "collected" from the existing databases, data warehouses,
and data marts. Data preprocessing usually includes at least two common tasks:
i. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with most
observations. Commonly, outliers result from measurement errors, coding and recording errors, and,
22
Regulation-2020 Academic Year-
2022-2023
sometimes, are natural, abnormal values. Such non representative samples can seriously affect the model
produced later. There are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.
ii. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as variable
scaling and different types of encoding. For example, one feature with the range [0, 1] and the other with
the range [−100, 1000] will not have the same weights in the applied technique; they will also influence
the final data-mining results differently. Therefore, it is recommended to scale them and bring both
features to the same weight for further analysis. Also, application-specific encoding methods usually
achieve
4. Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task in this phase.
This process is not straightforward; usually, in practice, the implementation is based on several models,
and selecting the best one is an additional task
23
Regulation-2020 Academic Year-
2022-2023
where m is the number of matches (i.e., the number of attributes for which i and j are in the same
state), and p is the total number of attributes describing the objects.
Symmetric Dissimilarity between binary attributes
o Manhattan distance
o Minkowski Distance
Data Transformation:
In data transformation, the data are transformed or consolidated into forms appropriate for
mining.
Smoothing, which works to remove noise from the data. Such techniques include
25
Regulation-2020 Academic Year-
2022-2023
Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Z-score normalization
27
Regulation-2020 Academic Year-
2022-2023
LECTURE NOTES
Association Rule Mining:Basic Concepts-Frequent item set mining methods:Apriori Algorithm –
A pattern growth approach for mining frequent item sets-Pattern Evaluation methods-Mining
Multilevel,Multi-dimensional space constraint based frequent pattern mining
Apriori property
All nonempty subsets of a frequent item set most also be frequent.
An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
If an item A is added to the item set I then the resulting item set (I U A) can not
occur more frequently than I.
Monotonic functions are functions that move in only one direction.
This property is called anti-monotonic.
If a set cannot pass a test, all its supersets will fail the same test as well.
This property is monotonic in failing the test.
29
Regulation-2020 Academic Year-
2022-2023
Mining multilevel association rules. Suppose given the task-relevant set of transactional data in
Table for sales in an AllElectronics store, showing the items purchased for each transaction. The
concept hierarchy for the items is shown in Figure 4. A concept hierarchy defines a sequence of
mappings from a set of low-level concepts to higher level, more general concepts. Data can be
generalized by replacing low-level concepts within the data by their higher-level concepts, or
ancestors, from a concept hierarchy.
31
Regulation-2020 Academic Year-
2022-2023
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or
multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework. In general, a top-down strategy is employed, where counts are
accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level 1
and working downward in the hierarchy toward the more specific concept levels, until no more frequent
itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used, such as
Apriori or its variations.
support): The same minimum support threshold is used when mining at each level of abstraction. For
example, in Figure 5, minimum support threshold of 5% is used throughout (e.g., for mining from
“computer” down to “laptop computer”). Both “computer” and “laptop computer” are found to be
frequent, while “desktop computer” is not. When a uniform minimum support threshold is used, the search
procedure is simplified. The method is also simple in that users are required to specify only one minimum
support threshold. An Apriori-like optimization technique can be adopted, based on the knowledge that an
32
Regulation-2020 Academic Year-
2022-2023
ancestor is a superset of its descendants: The search avoids examining itemsets containing any item whose
ancestors do not have minimum support.
33
Regulation-2020 Academic Year-
2022-2023
Following the terminology used in multidimensional databases, Each distinct predicate is referred in a
rule as a dimension. Hence, refer to Rule above as a single dimensional or intra dimensional
association rule because it contains a single distinct predicate (e.g., buys)with multiple occurrences
(i.e., the predicate occurs more than once within the rule). Such rules are commonly mined from
transactional data.
Considering each database attribute or warehouse dimension as a predicate, therefore mine association
rules containing multiple predicates, such as
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule above contains three predicates (age, occupation,
and buys), each of which occurs only once in the rule.
Hence, it has no repeated predicates. Multidimensional association rules with no repeated
predicates are called inter dimensional association rules.
Mine multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicates. These rules are called hybrid-dimensional association
rules. An example of such a rule is the following, where the predicate buys is repeated:
Categorical attributes have a finite number of possible values, with no ordering among
the values (e.g., occupation, brand, color).
Categorical attributes are also called nominal attributes, because their values are ―names
of things.
Quantitative attributes are numeric and have an implicit ordering among values (e.g.,
age, income, price).
Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes.
Constraint Based Frequent Pattern Mining
A data mining process may uncover thousands of rules from a given set of data, most of which end up
being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction of mining
may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find. Thus, a
34
Regulation-2020 Academic Year-
2022-2023
good heuristic is to have the users specify such intuition or expectations as constraints to confine the
search space. This strategy is known as constraint-based mining. The constraints can include the
following:
Knowledge type constraints: These specify the type of knowledge to be mined, such
as association or correlation.
Dimension/level constraints: These specify the desired dimensions the data, or levels of the
Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as metarules (rule templates), as the maximum or minimum number of predicates that
can occur in the rule antecedent or consequent, or as relationships among attributes, attribute
35
Regulation-2020 Academic Year-
2022-2023
LECTURE NOTES
Classification:Basic Concepts-Decision Tree Induction-Bayes Classification Methods-Rule Based
Classification-Model Evaluation and Selection techniques to improve classification accuracy-
Support Vector Machine Classification using Frequent patterns-Other Classification Methods
CLASSIFICATION
The process of finding a model (or function) that describes and distinguishes data classes or concepts,
for the purpose of being able to use the model to predict the class of objects whose class label is
unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose
class label is known).
It predicts categorical class labels
It classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data
Typical Applications
Credit approval target marketing
medical diagnosis
treatment effectiveness analysis
DECISION TREE INDUCTION
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label.
36
Regulation-2020 Academic Year-
2022-2023
The topmost node in the tree is the root node.The following decision tree is for the concept buy_computer
that indicates whether a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute_list, the set of candidate attributes;
Attribute_selection method, a procedure to determine the splitting criterion that "best" par-titions the
data tuples into individual classes. This criterion consists of a splitting_attribute and, possibly, either a
split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute_selection method(D, attribute_list) to find the "best" splitting_criterion;
(7) label node N with splitting_criterion;
(8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary
trees
(9) attribute list attribute_list-splitting_attribute; // remove splitting_attribute
(10) for each outcome j of splitting_criterion // partition the tuples and grow subtrees for each
partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N;
endfor
(15) return N;
37
Regulation-2020 Academic Year-
2022-2023
A. Because all of the tuples in a given partition have the same value for A, then A need not be
considered in any future partitioning of the tuples. Therefore, it is removed from attribute_list
(steps 8 to 9).
o A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding
to the conditions A ≤ split_point and A ≥ split_point, respectively, where split_point is the split-
point returned by Attribute_selection_method as part of the splitting criterion. (In practice, the
split-point, a, is often taken as the midpoint of two known adjacent values of A and therefore
may not actually be a pre-existing value of A from the training data.) Two branches are grown
from N and labeled according to the above outcomes The tuples are partitioned such thatD1
holds the subset of class-labeled tuples in D for which A≤ split point, while D2 holds the rest.
o A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection
measure or algorithm being used): The test at node N is of the form “A ∈ SA?”. SA is the splitting
subset for A, returned by Attribute selection method as part of the splitting criterion. It is a subset
of the known values of A.
o If a given tuple has value aj of A and if aj ∈ SA, then the test at node N is satisfied. Two
branches are grown from N .By convention, the left branch out of N is labeled yes so that D1
corresponds to the subset of class-labeled tuples in D that satisfy the test. The right branch out of
N is labeled no so that D2 corresponds to the subset of class-labeled tuples from D that do not
satisfy the test.
The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting
partition, Dj, of D (step 14).
The recursive partitioning stops only when any one of the following terminating conditions is true:
All of the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3), or
There are no remaining attributes on which the tuples may be further partitioned (step 4). In this case,
majority voting is employed (step 5). This involves converting node N into a leaf and labeling it with the
most common class in D. Alternatively, the class distribution of the node tuples may be stored.
There are no tuples for a given branch, that is, a partition D j is empty (step 12). In this case, a leaf is
created with the majority class in D (step 13).
The resulting decision tree is returned (step 15)
38
Regulation-2020 Academic Year-
2022-2023
A decision tree for the concept buys computer, indicating whether a customer at AllElectronics is likely to
purchase a computer.
Each internal (non-leaf) node represents a test on an attribute. Each leaf node represents a class (either
buys computer = yes or buys computer = no).
A typical decision tree is shown in Figure.It represents the concept buys computer, that is, it predicts
whether a customer at AllElectronics is likely to purchase a computer. Internal nodes are denoted by
rectangles, and leaf nodes are denoted by ovals.
Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two
other nodes), whereas others can produce non-binary trees
A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees
can easily be converted to classification rules.
Most algorithms for decision tree induction also follow such a top-down approach, which starts with a
training set of tuples and their associated class labels. The training set is recursively partitioned into
smaller subsets as the tree is being built. A basic decision tree algorithm is summarized in Figure 3.2.
The algorithm is called with three parameters: D, attribute list, and Attribute selection method. Refer to D
as a data partition. Initially, it is the complete set of training tuples and their associated class labels.
The parameter attribute list is a list of attributes describing the tuples. Attribute selection method specifies a
heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class.
This procedure employs an attribute selection measure, such as information gain or the gini index.
The tree starts as a single node, N, representing the training tuples in D (step 1).
If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that class
(steps 2 and 3).
Note that steps 4 and 5 are terminating conditions. All of the terminating conditions are explained at the
end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. The
splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or
partition the tuples in D into individual classes (step 6).
The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of
the chosen test.
BAYESIAN CLASSIFICATION
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.Bayesian classification is based on Bayes’
theorem, described below.
Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.
39
Regulation-2020 Academic Year-
2022-2023
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis H
holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
It iteratively learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward
neural network consists of an input layer, one or more hiddenlayers, and an output layer.
Example:
The inputs to the network correspond to the attributes measured for each training tuple. The inputs are
fed simultaneously into the units making up the input layer. These inputs pass through the input layer
and are then weighted and fed simultaneously to a second layer known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of
hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples
40
Regulation-2020 Academic Year-
2022-2023
Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve
Bayesian classifier to be comparable in performance with decision tree and selected neural network
classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large
databases.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the
values of the other attributes. This assumption is called class conditional independence. It is made to
simplify the computations involved and, in this sense, is considered “naïve.” Bayesian belief networks are
graphical models, which unlike naïve Bayesian classifiers allow the representation of dependencies among
subsets of attributes. Bayesian belief networks can also be used for classification.
There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that
value. Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
wheresplit point is the split-point returned by Attribute selection method as part of the
splitting criterion.
3 A is discrete-valued and a binary tree must be produced:
41
Regulation-2020 Academic Year-
2022-2023
(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced
K-NEAREST-NEIGHBOR CLASSIFIER:
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing agiven
test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional pattern
space. When given anunknown tuple, a k-nearest-neighbor classifier searches the pattern
space for the k trainingtuples that are closest to the unknown tuple. These k training tuples
are the k nearest neighbors of the unknown tuple.
For k-nearest-neighbor classification, the unknown tuple is assigned the mostcommon class
among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in
pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-
valuedprediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels
associatedwith the k nearest neighbors of the unknowntuple
42
Regulation-2020 Academic Year-
2022-2023
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed
and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Genetic algorithms attempt to incorporate ideas of natural evolution. In general, geneticlearning starts as
follows.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.
The rule ―IF A1 ANDNOT A 2 THENC 2‖ can be encoded as the bit string ―100,‖ where the
two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit represents
the class.
Similarly, the rule ―IF NOT A1 AND NOT A 2 THEN C1‖ can be encoded as ―001.‖
If an attribute has k values, where k > 2, then k bits may be used to encode theattribute’s values.
Classes can be encoded in a similar fashion.
44
Regulation-2020 Academic Year-
2022-2023
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Typically, the fitness of a rule is assessed by its classification accuracy on a set of
training samples.
Offspring are created by applying genetic operators such as crossover and mutation. In crossover,
substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly
selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues until a
population, P, evolves where each rule in P satisfies a pre specified fitness threshold.
Genetic algorithms are easily parallelizable and have been used for classification as well as other
optimization problems. In data mining, they may be used to evaluate the fitness of other
algorithms.
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzyset.
Fuzzy logic systemstypically provide graphical tools to assist users in convertingattribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory. training set are described by two
Boolean attributes, A1 and A2, and that there are two classes,C1 andC2.
The rule ―IF A1 ANDNOT A 2 THENC 2‖ can be encoded as the bit string ―100,‖ where
the two leftmost bits represent attributes A1 and A2 , respectively, and the rightmost bit
represents the class.
Similarly, the rule ―IF NOT A1 AND NOT A 2 THEN C1‖ can be encoded as ―001.‖
If an attribute has k values, where k > 2, then k bits may be used to encode theattribute’s
values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist ofthe fittest rules
in the current population, as well as offspring of these rules.
Typically, thefitness of a rule is assessed by its classification accuracy on a set
oftrainingsamples.
Offspring are created by applying genetic operators such as crossover and mutation.
45
Regulation-2020 Academic Year-
2022-2023
In crossover, substrings from pairs of rules are swapped to form new pairs of rules.
46
Regulation-2020 Academic Year-
2022-2023
Genetic algorithms are easily parallelizable and have been used for classification as well
as other optimization problems. In data mining, they may be used to evaluatethe fitness
of other algorithms.
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzyset.
Fuzzy logic systemstypically provide graphical tools to assist users in convertingattribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.
The response variable is what we want to predict.
Linear Regression:
where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
47
Regulation-2020 Academic Year-
2022-2023
48
Regulation-2020 Academic Year-
2022-2023
LECTURE NOTES
Clustering:Clustering analysis-Partitioning methods-Hierarchical methods-Density based
methods-Grid based methods-Model Based Clustering Methods –Clustering High Dimensional
Data-Constraint based Clustering Analysis-Introduction to Outlier analysis-Data Mining
Applications.
CLUSTER ANALYSIS
Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.
Cluster: A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.
Clustering is also called data segmentation in some applications because clustering partitions large data sets into
groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that
are “far away” from any cluster) may be more interesting than common cases.
Clustering is a challenging field of research in which its potential applications pose their own special
requirements. The following are typical requirements of clustering in data mining:
Scalability: Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.
Minimal requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to input certain parameters in cluster analysis (such as the number of desired
clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to
determine, especially for data sets containing high-dimensional objects. This not only burdens users, but
it also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or
erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
49
Regulation-2020 Academic Year-
2022-2023
Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot
incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must
determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different clustering depending on
the order of presentation of the input objects. It is important to develop incremental clustering algorithms and
algorithms that are insensitive to the order of input.
(5.1)
Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available
for all pairs of n objects. It is often represented by an n-by-n table:
(5.2)
Where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a
nonnegative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes
larger the more they differ. Since d(i, j)=d( j, i), and d(i, i)=0, the matrix in (5.2).
The rows and columns of the data matrix represent different entities, while those of the dissimilarity
matrix represent the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the
dissimilarity matrix is called a one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If
the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before
applying such clustering algorithms.
50
Regulation-2020 Academic Year-
2022-2023
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
Partitioning Methods:
A partitioning method constructs k partitions of the data, where each partition represents a cluster
and k <= n. That is, it classifies the data into k groups, which together satisfy the following
requirements:
Each group must contain at least one object, andEach object must belong to exactly one group.
A partitioning method creates an initial partitioning. It then uses an iterative relocation technique
that attempts to improve the partitioning by moving objects from one group to another.
The general criterion of a good partitioning is that objects in the same cluster are close or related to
each other, whereas objects of different clusters are far apart or very different.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on howthe
hierarchical decomposition is formed.
Theagglomerative approach, also called the bottom-up approach, starts with each objectforming
a separate group. It successively merges the objects or groups that are closeto one another, until
all of the groups are merged into one or until a termination condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects in the
same cluster. In each successiveiteration, a cluster is split up into smaller clusters, until
eventually each objectis in one cluster, or until a termination conditionholds.
51
Regulation-2020 Academic Year-
2022-2023
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
All of the clustering operations are performed on the grid structure i.e., onthe quantized
space. The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.
52
Regulation-2020 Academic Year-
2022-2023
Model-Based Methods:
Model-based methods hypothesize a model for each of the clusters and find the best fit of
the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.
Tasks in Data Mining:
Clustering High-Dimensional Data
Constraint-Based Clustering
Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications require the analysis of
objects containing a large number of features or dimensions
For example, text documents may contain thousands of terms or keywords as features, and DNA micro
array data may provide information on the expression levels of thousands of genes under hundreds of
conditions.
Clustering high-dimensional data is challenging due to the curseof dimensionality. Many dimensions
may not be relevant. As the number of dimensions increases, thedata become increasingly sparse so
that the distance measurement between pairs ofpoints become meaningless and the average density
of points anywhere in the data islikely to be low. Therefore, a different clustering methodology
needs to bedevelopedfor high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in
subspaces ofthe data, rather than over the entire data space.
Constraint-BasedClustering:
It is a clustering approach that performs clustering by incorporation of user-specified or application-
oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired clustering results, and
provides an effective means for communicating with the clustering process.
Various kinds of constraints can be specified, either by a user or as per application requirements.
Spatial clustering employs with the existence of obstacles and clustering under user- specified
constraints. In addition, semi-supervised clusteringemploys forpairwise constraints in order to
53
Regulation-2020 Academic Year-
2022-2023
improvethe quality of the resulting clustering.
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok clusters
so that the resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be
viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster mean or
center.
For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.
54
Regulation-2020 Academic Year-
2022-2023
The k-means partitioning algorithm:
The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
The k-means algorithm is sensitive to outliers because an object with an extremely large value may
substantially distort the distribution of data. This effect is particularly exacerbated due to the use of the
square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to
represent the clusters, using one representative object per cluster. Each remaining object is clustered with the
representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of the
dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion
is used, defined as
55
Regulation-2020 Academic Year-
2022-2023
whereE is the sum of the absolute error for all objects in the data set
pis the point inspace representing a given object in clusterCj
ojis the representative object of Cj
The initial representative objects are chosen arbitrarily. The iterative process of replacing representative
objects by non representative objects continues as long as the quality of the resulting clustering is improved.
This quality is estimated using a cost function that measures the average dissimilaritybetween an object and
the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a current
representativeobject, oj, the following four cases are examined for each of the nonrepresentative objects.
Case 1:
pcurrently belongs to representative object, o j. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by o randomasa representative object
and p is closest to o random, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by o randomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If o jis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Thek-MedoidsAlgorithm:
The k-medoids algorithm for partitioning based on medoid or central objects.
56
Regulation-2020 Academic Year-
2022-2023
The k-medoids method ismore robust than k-means in the presence of noise and outliers, because a
medoid is lessinfluenced by outliers or other extreme values than a mean. However, its processing ismore
costly than the k-means method.
57
Regulation-2020 Academic Year-
2022-2023
Hierarchical Clustering Methods:
A hierarchical clustering method works by grouping data objects into a tree ofclusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to performadjustment once
amerge or split decision hasbeen executed. That is, if a particular merge or split decision later turns out
to have been apoor choice, the method cannot backtrack and correct it.
Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending
on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.
This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic
clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain
termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their definition of
intercluster similarity.
This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with all objects
in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster on itsown or until it
satisfies certain termination conditions, such as a desired number ofclusters is obtained or the diameter of
each cluster is within a certain threshold.
Constraint-Based Cluster Analysis:
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
Outlier Analysis
Very often, there exist data objects that do not comply with the general behavior or model of the data. Such
data objects, which are grossly different from or inconsistent with the remaining set of data, are called
outliers. Outliers can be caused by measurement or execution error.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This,
however, could result in the loss of important hidden information because one person’s noise could be
another person’s signal. In other words, the outliers may be of particular interest, such as in the case of
fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an
interesting data mining task, referred to as outlier mining.
Outlier mining has wide applications. As mentioned previously, it can be used in fraud detection, for
example, by detecting unusual usage of credit cards or telecommunication services. In addition, it is useful
in customized marketing for identifying the spending behavior of customers with extremely low or
59
Regulation-2020 Academic Year-
2022-2023
extremely high incomes, or in medical analysis for finding unusual responses to various medical
treatments.
Outlier mining can be described as follows: Given a set of n data points or objects and k, the expected
number of outliers, find the top k objects that are considerably dissimilar, exceptional, or inconsistent with
respect to the remaining data. The outlier mining problem can be viewed as two subproblems: (1) define
what data can be considered as inconsistent in a given data set, and (2) find an efficient method to mine the
outliers so defined.
These can be categorized into four approaches: the statistical approach, the distance-based approach, the
density-based local outlier approach, and the deviation-based approach, each of which are studied here.
Notice that while clustering algorithms discard outliers as noise, they can be modified to include outlier
detection as a by-product of their execution. In general, users must check that each outlier discovered by
these approaches is indeed a “real” outlier.
Handling Outliers
Applications that derive their data from measurements have an associated amount of noise, which can be
viewed as outliers.
Alternately, outliers can be viewed as legitimate records having abnormal behavior. In general, clustering
techniques do not distinguish between the two: neither noise nor abnormalities fit into clusters.
Correspondingly, the preferable way to deal with outliers in partitioning the data is to keep one extra set of
outliers, so as not to pollute factual clusters. There are multiple ways of how descriptive learning handles
outliers.
Outlier Detection
Outliers are generally defined as models that are exceptionally far from the mainstream of data. There is no strict
mathematical definition of what alienation is; determining whether an observation is an abstraction is ultimately a
subjective exercise.
An outlier can be interpreted as data or observation that deviates greatly from the mean of a given protocol or set
of data. An exception may occur by accident, but it may indicate a measurement error or the given set of data may
have a heavier distribution.
Therefore, outlier detection can be defined as the process of detecting and then excluding outsiders from a given
set of data. There are no standardized outlier identification methods because these are mostly dataset-dependent.
Outlier detection as a branch of data processing has many applications in data stream analysis.
The problems of outlier detection by data stream and specific techniques used to detect streaming data in data
mining. We will also focus on recent research on outlier detection methods and external analysis.
Standard application of outlier detection such as fraud detection, public health, and sports, and touch on different
approaches such as proximity-based approaches and angle-based approaches.
60
Regulation-2020 Academic Year-
2022-2023
To identify the exterior in the database, it is important to keep in mind the context and find the answer to the most
basic and relevant question: “Why should I find outliers?” The context will explain the meaning of your findings.
Remember two important questions about your database during Outlier Identification:
(i) What and how many features do I consider for outlier detection? (Similarity/diversity)
(ii) Can I take the distribution (s) of values for the features I have selected? (Parameter / non-parameter)
1. Numeric Outlier
A numerical outlier is a simple, non-standard outlier detection technique in a one-dimensional feature space.
Exteriors are calculated by IQR (InterQuartile Range). For example, the first and third quarters (Q1, Q3) are
calculated. Outlier is a data point xi that is out of range.
Using the interquartile amplifier value k=1.5, the limits are the typical upper and lower whiskers of a box plot.
This technique can be easily implemented on the KNIME Analytics platform using the Numeric Outliers node.
2. Z-Score
The Z-score technique considers the Gaussian distribution of data. Outliers are data points that are on the tail of
the distribution and are therefore far from average.
The z-score of any data point can be calculated by the following expression, after making appropriate changes to
the selected feature interval of the dataset:
When calculating the z-score for each sample, a limit must be specified in the data set. Some good ‘thumb rule’
limits may be fixed deviations of 2.5, 3, 3.5, or more.
3. DBSCAN
This outlier detection technique is based on the DBSCAN clustering method. DBSCAN is a non-standard, density-
based outlier detection method. Here, all data points are defined as focal points, boundary points, or noise points.
4. Isolated forest
This non-parameter system is suitable for large datasets in one or more dimensional features. Isolation number is
very important in this outlier detection technique. Isolation number is the number of divisions required to isolate a
data point.
61
Regulation-2020 Academic Year-
2022-2023
There are many approaches to detecting abnormalities. Outlier detection models can be classified into the
following groups:
Extreme value analysis is the most basic form of outlier detection and is suitable for 1-dimensional data. In this
external analysis approach, the largest or smallest values are considered externally. The Z-Test and the Students’
T-Test are excellent examples.
These are good heuristics for the initial analysis of data but they are not of much value in multifaceted systems.
Extreme value analysis is often used as a final step in interpreting the outputs of other outlier detection methods.
2. Linear Models
In this approach, data is structured outside the lower dimensional substructure using linear interactions. The
distance of each data point is calculated for a plane that corresponds to the sub-interval. This distance is used to
detect outliers. PCA (primary component analysis) is an example of a linear model for anomaly detection.
In this approach, probability and statistical models consider specific distributions of data. Expectation-
enhancement (EM) methods are used to estimate the parameters of the sample. Finally, they calculate the
probability of the member of each data point for the calculated distribution. Points with the lowest probability of
membership are marked externally.
4. Proximity-based Models
In this mode, the outliers are designed as points of isolation from other observations. Cluster analysis, density-
based analysis, and neighborhood environment are key approaches of this type.
5. Information-theoretical models
In this mode, outliers increase the minimum code length to describe a data set.
62