Professional Documents
Culture Documents
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon,
a data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the
executive will have no data available to analyze because the previous data has been
updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional
view. Along with generalized and consolidated view of data, a data warehouses also
provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive
and effective analysis of data in a multidimensional space. This analysis results in data
generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.
Understanding a Data Warehouse
A data warehouse is a database, which is kept separate from the organization's
operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze
its business.
A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.
Data warehouse systems help in the integration of diversity of application systems.
A data warehouse system helps in consolidated historical data analysis.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of
data warehouse applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored in
it. The data can be processed by means of querying, basic statistical analysis,
reporting using crosstabs, tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden
patterns and associations, constructing analytical models, performing
classification and prediction. These mining results can be presented using the
visualization tools.
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
2 OLAP systems are used by OLTP systems are used by clerks, DBAs,
knowledge workers such as or database professionals.
executives, managers, and
analysts.
12 The database size is from 100GB The database size is from 100 MB to
to 100 TB. 100 GB.
Query-driven Approach
Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach
was used to build wrappers and integrators on top of multiple heterogeneous databases.
These integrators are also known as mediators.
Process of Query-Driven Approach
When a query is issued to a client side, a metadata dictionary translates the query
into an appropriate form for individual heterogeneous sites involved.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
Query-driven approach needs complex integration and filtering processes.
This approach is very inefficient.
It is very expensive for frequent queries.
This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow
update-driven approach rather than the traditional approach discussed earlier. In
update-driven approach, the information from multiple heterogeneous sources are
integrated in advance and are stored in a warehouse. This information is available for
direct querying and analysis.
Advantages
This approach has the following advantages −
This approach provide high performance.
The data is copied, processed, integrated, annotated, summarized and
restructured in semantic data store in advance.
Query processing does not require an interface to process data at local sources.
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy
to build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group.
For example, the marketing data mart may contain data related to items, customers, and
sales. Data marts are confined to subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts.
They are implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design
are not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information
providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes
or beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one
data warehouse to other.
Load Manager Architecture
The load manager performs the following functions −
Extract the data from source system.
Fast Load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data
warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information
providers. Gateways is the application programs that are used to extract data. It is
supported by underlying DBMS and allows client program to generate SQL to be
executed at a server. Open Database Connection(ODBC), Java Database Connection
(JDBC), are examples of gateway.
Fast Load
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
The transformations affects the speed of data processing.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant
when large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the
EPOS sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists
of third-party system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail
and then archived to tape. The detailed information part of data warehouse keeps the
detailed information in the starflake schema. Detailed information is loaded into the data
warehouse to supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed information is
stored and how it is used.
Note − If detailed information is held offline to minimize disk storage, we should make
sure that the data has been extracted, cleaned up, and transformed into starflake schema
before it is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations.
These aggregations are generated by the warehouse manager. Summary Information
must be treated as transient. It changes on-the-go in order to respond to the changing
query profiles.
The points to note about summary information are as follows −
Summary information speeds up the performance of common queries.
It increases the operational cost.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed
information.
Metadata
Metadata is simply defined as data about data. The data that are used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for
the contents in the book. In other words, we can say that metadata is the summarized
data that leads us to the detailed data.
In terms of data warehouse, we can define metadata as following −
Metadata is a road-map to data warehouse.
Metadata in data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the
following metadata −
Business metadata − It contains the data ownership information, business
definition, and changing policies.
Operational metadata − It includes currency of data and data lineage. Currency
of data refers to the data being active, archived, or purged. Lineage of data means
history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It
metadata includes source databases and their contents, data extraction, data
partition, cleaning, transformation rules, data refresh and purging rules.
The algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multipe attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribbute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted,
it can transformed and during that period some new data can be extracted. And while
the transformed data is being loaded into the data warehouse, the already extracted
data can be transformed. The block diagram of the pipelining of ETL process is shown
below:
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder,
CloverETL and MarkLogic.
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions
and facts. The dimensions are the entities with respect to which an enterprise preserves
the records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data
warehouse with respect to time, item, branch, and location. These dimensions allow to
keep track of monthly sales and at which branch the items were sold. There is a table
associated with each dimension. This table is known as dimension table. For example,
"item" dimension table may have attributes such as item_name, item_type, and
item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to
time, item, and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales
for New Delhi are shown with respect to time, and item dimensions according to type of
items sold. If we want to view the sales data with one more dimension, say, the location
dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect
to time, item, and location is shown in the table below −
The above 3-D table can be represented as 3-D data cube as shown in the following
figure −
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups
of people in an organization. In other words, a data mart contains only those data that
is specific to a particular group. For example, the marketing data mart may contain only
data related to items, customers, and sales. Data marts are confined to subjects.
Points to Remember About Data Marts
Windows-based or Unix/Linux-based servers are used to implement data marts.
They are implemented on low-cost servers.
The implementation cycle of a data mart is measured in short periods of time, i.e.,
in weeks rather than months or years.
The life cycle of data marts may be complex in the long run, if their planning and
design are not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data marts are flexible.
The following figure shows a graphical representation of data marts.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy to
build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
A data warehouse is never static; it evolves as the business expands. As the business
evolves, its requirements keep changing and therefore a data warehouse must be
designed to ride with these changes. Hence a data warehouse system needs to be
flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data
warehouse projects normally suffer from various issues that make it difficult to complete
tasks and deliverables in the strict and ordered fashion demanded by the waterfall
method. Most of the times, the requirements are not understood completely. The
architectures, designs, and build components can be completed only after gathering and
studying all the requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted
for the delivery of a data warehouse. We have staged the data warehouse delivery
process to minimize risks. The approach that we will discuss here does not reduce the
overall delivery time-scales but ensures the business benefits are delivered incrementally
through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery
risk.
The following diagram explains the stages in the delivery process −
IT Strategy
Data warehouse are strategic investments that require a business process to generate
benefits. IT Strategy is required to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived
from using a data warehouse. These benefits may not be quantifiable but the projected
benefits need to be clearly stated. If a data warehouse does not have a clear business
case, then the business tends to suffer from credibility problems at some stage during
the delivery process. Therefore in data warehouse projects, we need to understand the
business case for investment.
Education and Prototyping
Organizations experiment with the concept of data analysis and educate themselves on
the value of having a data warehouse before settling for a solution. This is addressed by
prototyping. It helps in understanding the feasibility and benefits of a data warehouse.
The prototyping activity on a small scale can promote educational process as long as −
The prototype addresses a defined technical objective.
The prototype can be thrown away after the feasibility concept has been shown.
The activity addresses a small subset of eventual data content of the data
warehouse.
The activity timescale is non-critical.
The following points are to be kept in mind to produce an early release and deliver
business benefits.
Identify the architecture that is capable of evolving.
Focus on business requirements and technical blueprint phases.
Limit the scope of the first build phase to the minimum that delivers business
benefits.
Understand the short-term and medium-term requirements of the data
warehouse.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are
understood. If we understand the business requirements for both short-term and
medium-term, then we can design a solution to fulfil short-term requirements. The short-
term solution can then be grown to a full solution.
The following aspects are determined in this stage −
The business rule to be applied on data.
The logical model for information within the data warehouse.
The query profiles for the immediate requirement.
The source systems that provide this data.
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long term requirements.
This phase also deliver the components that must be implemented in a short term to
derive any business benefit. The blueprint need to identify the followings.
Extending Scope
In this phase, the data warehouse is extended to address a new set of business
requirements. The scope can be extended in two ways −
By loading additional data into the data warehouse.
By introducing new data marts using the existing information.
Note − This phase should be performed separately, since it involves substantial efforts
and complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They
are not static. The delivery process must support this and allow these changes to be
reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within
business processes, as opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs, the
process operates as a pseudo-application development process, where the new
requirements are continually fed into the development activities and the partial
deliverables are produced. These partial deliverables are fed back to the users and then
reworked ensuring that the overall system is continually updated to meet the business
needs.
within itself.
with other data within the same data source.
with the data in other source systems.
with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data
increases the query performance and decreases the operational cost. The data contained
in a data warehouse must be transformed to support performance requirements and
control the ongoing operational costs.
Partition the Data
It will optimize the hardware performance and simplify the management of data
warehouse. Here we partition each fact table into multiple separate partitions.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact
that most common queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data
In order to recover the data in the event of data loss, software failure, or hardware failure,
it is necessary to keep regular back ups. Archiving involves removing the old data from
the system in a format that allow it to be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data
for 3 years with the latest 6 months data being kept online. In such as scenario, there is
often a requirement to be able to do month-on-month comparisons for this year and
last year. In this case, we require some data to be restored from the archive.
Query Management Process
This process performs the following functions −
manages the queries.
helps speed up the execution time of queris.
directs the queries to their most effective data sources.
ensures that all the system sources are used in the most effective way.
monitors actual query profiles.
The information generated in this process is used by the warehouse management
process to determine which aggregations to generate. This process does not generally
operate during the regular load of information into data warehouse.
OLAP SERVER
Online Analytical Processing Server (OLAP) is based on the multidimensional data model.
It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information. This chapter cover the types of OLAP,
operations on OLAP, difference between OLAP, and statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers −
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views
of data. With multidimensional data stores, the storage utilization may be low if the data
set is sparse. Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
2 OLAP systems are used by OLTP systems are used by clerks, DBAs,
knowledge workers such as or database professionals.
executives, managers and
analysts.
Database server
ROLAP server
Front-end tool.
Advantages
Database server.
MOLAP server.
Front-end tool.
Advantages
4 Maintains a separate database for It may not require space other than
data cubes. available in the Data warehouse.
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a schema.
A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact
Constellation schema. In this chapter, we will discuss the schemas used in a data
warehouse.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause data
redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian
province of British Columbia. The entries for such cities may cause data redundancy along
the attributes province_or_state and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized.
For example, the item dimension table in star schema is normalized and split into
two dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the operating
cost.
This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data warehouse.
Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than time such
as product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like
on a state by state basis. If each region wants to query on information captured within
its region, it would prove to be more effective to partition the fact table into regional
partitions. This will cause the queries to speed up because it does not require to scan
information that is not relevant.
Points to Note
The query does not have to scan irrelevant data which speeds up the query
process.
This technique is not appropriate where the dimensions are unlikely to change in
future. So, it is worth determining that the dimension does not change in future.
If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension,
unless you are certain that the suggested dimension grouping will not change within the
life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined
size as a critical point. When the table exceeds the predetermined size, a new table
partition is created.
Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the
dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in
order to apply comparisons, that dimension may be very large. This would definitely
affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It
uses metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Vertical partitioning can be performed in the following two ways −
Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this
method, the rows are collapsed into a single row, hence it reduce space. Take a look at
the following tables that show how normalization is performed.
Table before Normalization
Product_i Qt Valu sales_dat Store_i Store_nam Location Regio
d y e e d e n
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to
perform a major join operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will
lead to reorganizing the fact table. Let's have an example. Suppose we want to partition
the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has
different number of branches. That will give us 30 partitions, which is reasonable. This
partitioning is good enough because our requirements capture has shown that a vast
majority of queries are restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from
every region will be in one partition. Now the user who wants to look at data within his
own region has to query across multiple partitions.
Hence it is worth determining the right partitioning key.
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for
the contents in the book. In other words, we can say that metadata is the summarized
data that leads us to detailed data. In terms of data warehouse, we can define metadata
as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of
a given data warehouse. Along with this metadata, additional metadata is also created
for time-stamping any extracted data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition,
and changing policies.
Technical Metadata − It includes database system names, table and column
names and sizes, data types and allowed values. Technical metadata also includes
structural information such as primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency
of data means whether the data is active, archived, or purged. Lineage of data
means the history of data migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the data
warehouse.
Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly
summarized data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata is used for query tools.
Metadata is used in extraction and cleansing tools.
Metadata is used in reporting tools.
Metadata is used in transformation tools.
Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
Definition of data warehouse − It includes the description of structure of data
warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
Business metadata − It contains has the data ownership information, business
definition, and changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency
of data means whether the data is active, archived, or purged. Lineage of data
means the history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It
includes the source databases and their contents, data extraction, data partition
cleaning, transformation rules, data refresh and purging rules.
Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
Challenges for Metadata Management
The importance of metadata can not be overstated. Metadata helps in driving the
accuracy of reports, validates data transformation, and ensures the accuracy of
calculations. Metadata also enforces the definition of business terms to business end-
users. With all these uses of metadata, it also has its challenges. Some of the challenges
are discussed below.
Metadata in a big organization is scattered across the organization. This metadata
is spread in spreadsheets, databases, and applications.
Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
There are no industry-wide accepted standards. Data management solution
vendors have narrow focus.
There are no easy and accepted methods of passing metadata.
Given below are the issues to be taken into account while determining the functional
split −
The structure of the department may change.
The products might switch from one department to other.
The merchant could query the sales trend of other products to analyze what is
happening to the sales.
Note − We need to determine the business benefits and technical feasibility of using a
data mart.
Identify User Access Tool Requirements
We need data marts to support user access tools that require internal data structures.
The data in such structures are outside the control of data warehouse but need to be
populated and updated on a regular basis.
There are some tools that populate directly from the source system but some cannot.
Therefore additional requirements outside the scope of the tool are needed to be
identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not
be directly populated from the data warehouse, rather each tool must have its own data
mart.
Identify Access Control Issues
There should to be privacy rules to ensure the data is accessed by authorized users only.
For example a data warehouse for retail banking institution ensures that all the accounts
belong to the same legal entity. Privacy laws can force you to totally prevent access to
information that is not owned by the specific bank.
Data marts allow us to build a complete wall by physically separating data segments
within the data warehouse. To avoid possible privacy problems, the detailed data can be
removed from the data warehouse. We can create data mart for each legal entity and
load it via data warehouse, with detailed account data.
Designing Data Marts
Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps
in maintaining control over database instances.
The summaries are data marted in the same way as they would have been designed
within the data warehouse. Summary tables help to utilize all dimension data in the
starflake schema.
Cost of Data Marting
The cost measures for data marting are as follows −
Hardware and Software Cost
Network Access
Time Window Constraints
Hardware and Software Cost
Although data marts are created on the same hardware, they require some additional
hardware and software. To handle user queries, it requires additional processing power
and disk storage. If detailed data and the data mart exist within the data warehouse, then
we would face additional cost to store and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used
as an additional strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should
ensure that the LAN or WAN has the capacity to handle the data volumes being
transferred within the data mart load process.
Time Window Constraints
The extent to which a data mart loading process will eat into the available time window
depends on the complexity of the transformations and the data volumes being shipped.
The determination of how many data marts are possible depends on −
Network capacity.
Time window available
Volume of data being transferred
Mechanisms being used to insert data into a data mart
Hardware failure
Running out of space on certain key disks
A process dying
A process returning an error
CPU usage exceeding an 805 threshold
Internal contention on database serialization points
Buffer cache hit ratios exceeding or failure below threshold
A table reaching to maximum of its size
Excessive memory swapping
A table failing to extend due to lack of space
Disk exhibiting I/O bottlenecks
Usage of temporary or sort area reaching a certain thresholds
Any other database shared memory usage
The most important thing about events is that they should be capable of executing on
their own. Event packages define the procedures for the predefined events. The code
associated with each event is known as event handler. This code is executed whenever
an event occurs.
System and Database Manager
System and database manager may be two separate pieces of software, but they do the
same job. The objective of these tools is to automate certain processes and to simplify
the execution of others. The criteria for choosing a system and the database manager
are as follows −
Scheduling
Backup data tracking
Database awareness
Backups are taken only to protect against data loss. Following are the important points
to remember −
The backup software will keep some form of database of where and when the
piece of data was backed up.
The backup recovery manager must have a good front-end to that database.
The backup recovery software should be database aware.
Being aware of the database, the software then can be addressed in database
terms, and will not perform backups that would not be viable.
Process managers are responsible for maintaining the flow of data both into and out of
the data warehouse. There are three different types of process managers −
Load manager
Warehouse manager
Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data into the
database. The size and complexity of a load manager varies between specific solutions
from one data warehouse to another.
Load Manager Architecture
The load manager does performs the following functions −
Extract data from the source system.
Fast load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data
warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information
providers. Gateways are the application programs that are used to extract data. It is
supported by underlying DBMS and allows the client program to generate SQL to be
executed at a server. Open Database Connection (ODBC) and Java Database Connection
(JDBC) are examples of gateway.
Fast Load
In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
Transformations affect the speed of data processing.
It is more effective to load the data into a relational database prior to applying
transformations and checks.
Gateway technology is not suitable, since they are inefficient when large data
volumes are involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After completing
simple transformations, we can do complex checks. Suppose we are loading the EPOS
sales transaction, we need to perform the following checks −
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It
consists of a third-party system software, C programs, and shell scripts. The size and
complexity of a warehouse manager varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
User access
Data load
Data movement
Query generation
User Access
We need to first classify the data and then classify the users on the basis of the data they
can access. In other words, the users are classified according to the data they can access.
Data Classification
The following two approaches can be used to classify the data −
Data can be classified according to its sensitivity. Highly-sensitive data is classified
as highly restricted and less-sensitive data is classified as less restrictive.
Data can also be classified according to the job function. This restriction allows
only specific users to view particular data. Here we restrict the users to view only
that part of the data in which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example.
Suppose you are building the data warehouse for a bank. Consider that the data being
stored in the data warehouse is the transaction data for all the accounts. The question
here is, who is allowed to see the transaction data. The solution lies in classifying the data
according to the function.
User classification
The following approaches can be used to classify the users −
Users can be classified as per the hierarchy of users in an organization, i.e., users
can be classified by departments, sections, groups, and so on.
Users can also be classified according to their role, with people grouped across
departments based on their role.
Classification on basis of Department
Let's have an example of a data warehouse where the users are from sales and marketing
department. We can have security by top-to-down company view, with access centered
on the different departments. But there could be some restrictions on users at different
levels. This structure is shown in the following diagram.
But if each department accesses different data, then we should design the security access
for each department separately. This can be achieved by departmental data marts. Since
these data marts are separated from the data warehouse, we can enforce separate
security restrictions on each data mart. This approach is shown in the following figure.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on
the system. To complete an audit in time, we require more hardware and therefore, it is
recommended that wherever possible, auditing should be switched off. Audit
requirements can be categorized as follows −
Connections
Disconnections
Data access
Data change
Note − For each of the above-mentioned categories, it is necessary to audit success,
failure, or both. From the perspective of security reasons, the auditing of failures are very
important. Auditing of failure is important because they can highlight unauthorized or
fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network
security requirement. We need to consider the following issues −
Is it necessary to encrypt data before transferring it to the data warehouse?
Are there restrictions on which network routes the data can take?
These restrictions need to be considered carefully. Following are the points to remember
−
The process of encryption and decryption will increase overheads. It would require
more processing power and processing time.
The cost of encryption can be high if the system is already a loaded system
because the encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to
transfer some restricted data as a flat file to be loaded. When the data is loaded into the
data warehouse, the following questions are raised −
Data classification
User classification
Network requirements
Data movement and storage requirements
All auditable actions
Impact of Security on Design
Security affects the application code and the development timescales. Security affects
the following area −
Application development
Database design
Testing
Application Development
Security affects the overall application development and it also affects the design of the
important components of the data warehouse such as load manager, warehouse
manager, and query manager. The load manager may require checking code to filter
record and place them in different locations. More transformation rules may also be
required to hide certain data. Also there may be requirements of extra metadata to
handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to
enforce security. Extra checks may have to be coded into the data warehouse to prevent
it from being fooled into moving data into a location where it should not be available.
The query manager requires the changes to handle any access restrictions. The query
manager will need to be aware of all extra views and aggregations.
Database design
The database layout is also affected because when security measures are implemented,
there is an increase in the number of views and tables. Adding security increases the size
of the database and hence increases the complexity of the database design and
management. It will also add complexity to the backup management and recovery plan.
Testing
Testing the data warehouse is a complex and lengthy process. Adding security to the
data warehouse also affects the testing time complexity. It affects the testing in the
following two ways −
It will increase the time required for integration and system testing.
There is added functionality to be tested which will increase the size of the testing
suite.
A data warehouse is a complex system and it contains a huge volume of data. Therefore
it is important to back up all the data so that it becomes available for recovery in future
as per requirement. In this chapter, we will discuss the issues in designing the backup
strategy.
Backup Terminologies
Before proceeding further, you should know some of the backup terminologies discussed
below.
Complete backup − It backs up the entire database at the same time. This backup
includes all the database files, control files, and journal files.
Partial backup − As the name suggests, it does not create a complete backup of
the database. Partial backup is very useful in large databases because they allow
a strategy whereby various parts of the database are backed up in a round-robin
fashion on a day-to-day basis, so that the whole database is backed up effectively
once a week.
Cold backup − Cold backup is taken while the database is completely shut down.
In multi-instance environment, all the instances should be shut down.
Hot backup − Hot backup is taken when the database engine is up and running.
The requirements of hot backup varies from RDBMS to RDBMS.
Online backup − It is quite similar to hot backup.
Hardware Backup
It is important to decide which hardware to use for the backup. The speed of processing
the backup and restore depends on the hardware being used, how the hardware is
connected, bandwidth of the network, backup software, and the speed of server's I/O
system. Here we will discuss some of the hardware choices that are available and their
pros and cons. These choices are as follows −
Tape Technology
Disk Backups
Tape Technology
The tape choice can be categorized as follows −
Tape media
Standalone tape drives
Tape stackers
Tape silos
Tape Media
There exists several varieties of tape media. Some tape media standards are listed in the
table below −
Tape Media Capacity I/O rates
DLT 40 GB 3 MB/s
8 mm 14 GB 1 MB/s
Tape Stackers
The method of loading multiple tapes into a single tape drive is known as tape stackers.
The stacker dismounts the current tape when it has finished with it and loads the next
tape, hence only one tape is available at a time to be accessed. The price and the
capabilities may vary, but the common ability is that they can perform unattended
backups.
Tape Silos
Tape silos provide large store capacities. Tape silos can store and manage thousands of
tapes. They can integrate multiple tape drives. They have the software and hardware to
label and store the tapes they store. It is very common for the silo to be connected
remotely over a network or a dedicated link. We should ensure that the bandwidth of
the connection is up to the job.
Disk Backups
Methods of disk backups are −
Disk-to-disk backups
Mirror breaking
These methods are used in the OLTP system. These methods minimize the database
downtime and maximize the availability.
Disk-to-Disk Backups
Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for
the following reasons −
Networker Legato
ADSM IBM
Omniback II HP
Alexandria Sequent
Fixed queries
Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
regular reports
Canned queries
Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational database system.
The only difference is that the amount of data to be queried may be different. It is good
to store the most successful execution plan while testing fixed queries. Storing these
executing plan will allow us to spot changing data size and data skew, as it will cause the
execution plan to change.
Note − We cannot do more on fact table but while dealing with dimension tables or the
aggregations, the usual collection of SQL tweaking, storage mechanism, and access
methods can be used to tune these queries.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the data
warehouse. For each user or group of users, you need to know the following −
Unit testing
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.
Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
This test is performed by the developer.
Integration Testing
In integration testing, the various modules of the application are brought together
and then tested against the number of inputs.
It is performed to test whether the various components do well after integration.
System Testing
In system testing, the whole data warehouse application is tested together.
The purpose of system testing is to check whether the entire system works
correctly together or not.
System testing is performed by the testing team.
Since the size of the whole data warehouse is very large, it is usually possible to
perform minimal system testing before the test plan can be enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In this
schedule, we predict the estimated time required for the testing of the entire data
warehouse system.
There are different methodologies available to create a test schedule, but none of them
are perfect because the data warehouse is very complex and large. Also the data
warehouse system is evolving in nature. One may face the following issues while creating
a test schedule −
A simple problem may have a large size of query that can take a day or more to
complete, i.e., the query does not complete in a desired time scale.
There may be hardware failures such as losing a disk or human errors such as
accidentally deleting a table or overwriting a large table.
Note − Due to the above-mentioned difficulties, it is recommended to always double
the amount of time you would normally allow for testing.
Testing Backup Recovery
Testing the backup recovery strategy is extremely important. Here is the list of scenarios
for which this testing is needed −
Media failure
Loss or damage of table space or data file
Loss or damage of redo log file
Loss or damage of control file
Instance failure
Loss or damage of archive file
Loss or damage of table
Failure during data failure
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
Security − A separate security document is required for security testing. This
document contains a list of disallowed operations and devising tests for each.
Scheduler − Scheduling software is required to control the daily operations of a
data warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of aggregations.
Disk Configuration. − Disk configuration also needs to be tested to identify I/O
bottlenecks. The test should be performed with multiple times with different
settings.
Management Tools. − It is required to test all the management tools during
system testing. Here is the list of tools that need to be tested.
o Event manager
o System manager
o Database manager
o Configuration manager
o Backup recovery manager
Testing the Database
The database is tested in the following three ways −
Testing the database manager and monitoring tools − To test the database
manager and the monitoring tools, they should be used in the creation, running,
and management of test database.
Testing database features − Here is the list of features that we have to test −
o Querying in parallel
o Create index in parallel
o Data load in parallel
Testing database performance − Query execution plays a very important role in
data warehouse performance measures. There are sets of fixed queries that need
to be run regularly and they should be tested. To test ad hoc queries, one should
go through the user requirement document and understand the business
completely. Take time to test the most awkward queries that the business is likely
to ask against different index and aggregation strategies.
Testing the Application
All the managers should be integrated correctly and work in order to ensure that
the end-to-end load, index, aggregate and queries work as per the expectations.
Each function of each manager should work correctly
It is also necessary to test the application over a period of time.
Week end and month-end tasks should also be tested.
Logistic of the Test
The aim of system test is to test all of the following areas −
Scheduling software
Day-to-day operational procedures
Backup recovery strategy
Management and scheduling tools
Overnight processing
Query performance
Note − The most important point is to test the scalability. Failure to do so will leave us a
system design that does not work when the system grows.
Unit 2. Data Mining
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Applications
Data mining is highly useful in the following domains −
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here
is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of
a class or a concept are called class/concept descriptions. These descriptions can be
derived by the following two ways −
Data Characterization − This refers to summarizing data of class under study.
This class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with
some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together,
for example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such
as graphs, trees, or lattices, which may be combined with item-sets or
subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data
and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data
to be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture
transformations.
Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.
Note:
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.
Preprocessing of databases consists of Data cleaning and Data Integration.
Disadvantages
This approach has the following disadvantages −
The Query Driven Approach needs complex integration and filtering processes.
It is very inefficient and very expensive for frequent queries.
This approach is expensive for queries that require aggregations.
Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather than the
traditional approach discussed earlier. In the update-driven approach, the information
from multiple heterogeneous sources is integrated in advance and stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
This approach provides high performance.
The data can be copied, processed, integrated, annotated, summarized and
restructured in the semantic data store in advance.
Query processing does not require interface with the processing at local sources.
From Data Warehousing (OLAP) to Data Mining (OLAM)
Online Analytical Mining integrates with Online Analytical Processing with data mining
and mining knowledge in multidimensional databases. Here is the diagram that shows
the integration of both OLAP and OLAM −
Importance of OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools are required
to work on integrated, consistent, and cleaned data. These steps are very costly in
the preprocessing of data. The data warehouses constructed by such
preprocessing are valuable sources of high quality data for OLAP and data mining
as well.
Available information processing infrastructure surrounding data
warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous
databases, web-accessing and service facilities, reporting and OLAP analysis tools.
OLAP−based exploratory data analysis − Exploratory data analysis is required
for effective data mining. OLAM provides facility for data mining on various subset
of data and at different levels of abstraction.
Online selection of data mining functions − Integrating OLAP with multiple data
mining functions and online analytical mining provide users with the flexibility to
select desired data mining functions and swap data mining tasks dynamically.
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Engine
Data mining engine is very essential to the data mining system. It consists of a set of
functional modules that perform the following functions −
Characterization
Association and Correlation Analysis
Classification
Prediction
Cluster analysis
Outlier analysis
Evolution analysis
Knowledge Base
This is the domain knowledge. This knowledge is used to guide the search or evaluate
the interestingness of the resulting patterns.
Knowledge Discovery
Some people treat data mining same as knowledge discovery, while others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
User interface
User interface is the module of data mining system that helps the communication
between users and the data mining system. User Interface allows the following
functionalities −
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a)
databases mined, (b) knowledge mined, (c) techniques utilized, and (d) applications
adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data,
etc. And the data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge mined. It means
the data mining system is classified on the basis of functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can
describe these techniques according to the degree of user interaction involved or the
methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These
applications are as follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail
Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data warehouse system,
then there will be no system to communicate with. This scheme is known as the non-
coupling scheme. In this scheme, the main focus is on data mining design and on
developing efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source
and processes that data using some data mining algorithms. The data mining
result is stored in another file.
Loose Coupling − In this scheme, the data mining system may use some of the
functions of database and data warehouse system. It fetches the data from the
data respiratory managed by these systems and performs data mining on that
data. It then stores the mining result either in a file or in a designated place in a
database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a
database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly
integrated into the database or data warehouse system. The data mining
subsystem is treated as one functional component of an information system.
Data Mining - Query Language
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the
DBMiner data mining system. The Data Mining Query Language is actually based on the
Structured Query Language (SQL). Data Mining Query Languages can be designed to
support ad hoc and interactive data mining. This DMQL provides commands for
specifying primitives. The DMQL can work with databases and data warehouses as well.
DMQL can be used to define data mining tasks. Particularly we examine how to define
data warehouses and data marts in DMQL.
Syntax for Task-Relevant Data Specification
Here is the syntax of DMQL for specifying task-relevant data −
use database database_name
or
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost
$100 or more on an average; and budget spenders as customers who purchase items at
less than $100 on an average. The mining of discriminant descriptions for customers from
each of these categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are
object variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are
determined by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Syntax for Concept Hierarchy Specification
To specify concept hierarchies, use the following syntax −
use hierarchy <hierarchy> for <attribute_or_dimension>
We use different syntaxes to define different types of hierarchies such as−
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model
or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
The rule is pruned is due to the following reason −
The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Here we will discuss other classification methods such as Genetic Algorithms, Rough Set
Approach, and Fuzzy Set Approach.
Genetic Algorithms
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first
of all, the initial population is created. This initial population consists of randomly
generated rules. We can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean attributes
such as A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode the
attribute values. The classes are also encoded in the same manner.
Points to remember −
Based on the notion of the survival of the fittest, a new population is formed that
consists of the fittest rules in the current population and offspring values of these
rules as well.
The fitness of a rule is assessed by its classification accuracy on a set of training
samples.
The genetic operators such as crossover and mutation are applied to create
offspring.
In crossover, the substring from pair of rules are swapped to form a new pair of
rules.
In mutation, randomly selected bits in a rule's string are inverted.
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster has to contain at least
a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number
of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to
trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or
precision as follows −
F-score = recall x precision / (recall + precision) / 2
he
procedure of VIPS algorithm −
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption
and services. It is natural that the quantity of data collected will continue to expand
rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail,
web data transmission, etc. Due to the development of new computer and
communication technologies, the telecommunication industry is rapidly expanding. This
is the reason why data mining is become very important to help and understand the
business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute,
dimension, feature, and variable are often used interchangeably in the literature. The term dimension is
commonly used in data warehousing. Machine learning literature tends to use the term feature, while
statisticians prefer the term variable. Data mining and database professionals commonly use the term
attribute, and we do here as well. Attributes describing a customer object can include, for example,
customer ID, name, and address. Observed values for a given attribute are known as observations. A set
of attributes used to describe a given object is called an attribute vector (or feature vector). The
distribution of data involving one attribute (or variable) is called univariate. A bivariate distribution
involves two attributes, and so on.
The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or 1—the
attribute can have. In the following subsections, we introduce each type.
1. Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical. The values do not have any
meaningful order. In computer science, the values are also known as enumerations.
Although we said that the values of a nominal attribute are symbols or “names of
things,” it is possible to represent such symbols or “names” with numbers. With hair
color, for instance, we can assign a code of 0 for black, 1 for brown, and so on. Another
example is customor ID, with possible values that are all numeric. However, in such
cases, the numbers are not intended to be used quantitatively. That is, mathematical
operations on values of nominal attributes are not meaningful
Example 2.1 Nominal attributes. Suppose that hair color and marital status are two attributes describing
person objects. In our application, possible values for hair color are black, brown, blond, red, auburn,
gray, and white. The attribute marital status can take on the values single, married, divorced, and
widowed. Both hair color and marital status are nominal attributes
2. Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1,
where 0 typically means that the attribute is absent, and 1 means that it is present.
Binary attributes are referred to as Boolean if the two states correspond to true and
false.
Example 2.2 Binary attributes. Given the attribute smoker describing a patient object, 1
indicates that the patient smokes, while 0 indicates that the patient does not. Similarly,
suppose the patient undergoes a medical test that has two possible outcomes. The
attribute medical test is binary, where a value of 1 means the result of the test for the
patient is positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome should be
coded as 0 or 1. One such example could be the attribute gender having the states male
and female.
A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for HIV. By
convention, we code the most important outcome, which is usually the rarest one, by 1
(e.g., HIV positive) and the other by 0 (e.g., HIV negative).
3. Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Ordinal attributes are useful for registering subjective assessments of qualities that
cannot be measured objectively; thus ordinal attributes are often used in surveys for
ratings. In one survey, participants were asked to rate how satisfied they were as
customers. Customer satisfaction had the following ordinal categories: 0: very
dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied.
Example 2.3 Ordinal attributes. Suppose that drink size corresponds to the size of drinks
available at a fast-food restaurant. This nominal attribute has three possible values:
small, medium, and large. The values have a meaningful sequence (which corresponds
to increasing drink size); however, we cannot tell from the values how much bigger, say,
a medium is than a large.
4. Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
i. Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, in
addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values.
Example 2.4 Interval-scaled attributes.
A temperature attribute is interval-scaled. Suppose that we have the outdoor
temperature value for a number of different days, where each day is an object. By
ordering the values, we obtain a ranking of the objects with respect to temperature. In
addition, we can quantify the difference between values. For example, a temperature of
20◦C is five degrees higher than a temperature of 15◦C. Calendar dates are another
example. For instance, the years 2002 and 2010 are eight years apart.
ii. Ratio-Scaled Attributes
For data preprocessing to be successful, it is essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
In this section, we look at various ways to measure the central tendency of data. Suppose that
we have some attribute X, like salary, which has been recorded for a set of objects. Let
x1,x2,...,xN be the set of N observed values or observations for X. Here, these values may also be
referred to as the data set (for X). If we were to plot the observations for salary, where would
most of the values fall? This gives us an idea of the central tendency of the data. Measures of
central tendency include the mean, median, mode, and midrange. The most common and
effective numeric measure of the “center” of a set of data is the (arithmetic) mean. Let
x1,x2,...,xN be a set of N values or observations, such as for some numeric attribute X, like
salary. The mean of this set of values is,
This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational
database system.
Although the mean is the singlemost useful quantity for describing a data set, it is not always
the best way of measuring the center of the data. A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the mean. For
example, the mean salary at a company may be substantially pushed up by that of a few highly paid
managers. Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very
low scores. To offset the effect caused by a small number of extreme values, we can instead use the
trimmed mean, which is the mean obtained after chopping off values at the high and low extremes. For
example, we can sort the values observed for salary and remove the top and bottom 2% before
computing the mean. We should avoid trimming too large a portion (such as 20%) at both ends, as this
can result in the loss of valuable information.
For skewed (asymmetric) data, a better measure of the center of data is the median, which is the middle
value in a set of ordered data values. It is the value that separates the higher half of a data set from the
lower half. In probability and statistics, the median generally applies to numeric data; however, we may
extend the concept to ordinal data. Suppose that a given data set of N values for an attribute X is sorted
in increasing order. If N is odd, then the median is the middle value of the ordered set. If N is even, then
the median is not unique; it is the two middlemost values and any value in between. If X is a numeric
attribute in this case, by convention, the median is taken as the average of the two middlemost values.
lower than the median interval, freqmedian is the frequency of the median interval, and width is the
width of the median interval.
The mode is another measure of central tendency. The mode for a set of data is the value that occurs
most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. It
is possible for the greatest frequency to correspond to several different values, which results in more
than one mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal. In general, a data set with two or more modes is multimodal. At the other extreme, if each
data value occurs only once, then there is no mode.
2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range
We now look at measures to assess the dispersion or spread of numeric data. The measures include
range, quantiles, quartiles, percentiles, and the interquartile range. The five-number summary, which
can be displayed as a boxplot, is useful in identifying outliers. Variance and standard deviation also
indicate the spread of a data distribution.
To start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartile range as
measures of data dispersion.
The range of the set is the difference between the largest (max()) and smallest (min()) values. Suppose
that the data for attribute X are sorted in increasing numeric order. Imagine that we can pick certain
data points so as to split the data distribution into equal-size consecutive sets .These data points are
called quantiles.
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equalsize
consecutive sets. The kth q-quantile for a given data distribution is the value x such that at most k/q of
the
data values are less than x and at most (q − k)/q of the data values are more than x, where k is an integer
such that 0 < k < q. There are q − 1 q-quantiles.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It
corresponds to the median.
The 4-quantiles are the three data points that split the data distribution into four equal parts; each part
represents one-fourth of the data distribution. They are more commonly referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into
100 equal-sized consecutive sets. The median, quartiles, and percentiles are the most widely used forms
of quantiles.
The distance between the first and third quartiles is a simple measure of spread that gives the range
covered by the middle half of the data. This distance is called the interquartile range (IQR) and is
defined as IQR = Q3 − Q1.
No single numeric measure of spread (e.g., IQR) is very useful for describing skewed distributions.
In the symmetric distribution, the median (and other measures of central tendency) splits the data into
equal-size halves. This does not occur for skewed distributions. Therefore, it is more informative to also
provide the two quartiles Q1 and Q3, along with the median. A common rule of thumb for identifying
suspected outliers is to single out values falling at least 1.5 × IQR above the third quartile or below the
first quartile. Because Q1, the median, and Q3 together contain no information about the endpoints
(e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the
lowest and highest data values as well. This is known as the five-number summary. The five-number
summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and
largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number
summary as follows: Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range. The median is marked by a line within the box. Two lines (called whiskers) outside
the box extend to the smallest (Minimum) and largest (Maximum) observations.
Variance and standard deviation are measures of data dispersion. They indicate how spread out a data
distribution is. A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large range of
values.
In this section, we study graphic displays of basic statistical descriptions. These include quantile plots,
quantile–quantile plots, histograms, and scatter plots. Such graphs are helpful for the visual inspection
of data, which is useful for data preprocessing. The first three of these show univariate distributions
(i.e., data for one attribute), while scatter plots show bivariate distributions (i.e., involving two
attributes).
Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate data distribution. First, it
displays all of the data for the given attribute to assess both the overall behavior and unusual
occurrences. Second, it plots quantile information . Let xi , for i = 1 to N, be the data sorted in increasing
order so that x1 is the smallest observation and xN is the largest for some ordinal or numeric attribute X.
Each observation, xi , is paired with a percentage, fi , which indicates that approximately fi × 100% of the
data are below the value, xi . We say “approximately” because there may not be a value with exactly a
fraction, fi , of the data below xi . Note that the 0.25 percentile corresponds to quartile Q1, the 0.50
percentile is the median, and the 0.75 percentile is Q3.
These numbers increase in equal steps of 1/N, ranging from 1 2N (which is slightly above 0) to 1 − 1 2N
(which is slightly below 1). On a quantile plot, xi is graphed against fi . This allows us to compare
different distributions based on their quantiles. For example, given the quantile plots of sales data for
two different time periods, we can compare their Q1, median, Q3, and other fi values at a glance.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used. “Histos” means
pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting histograms is a
graphical method for summarizing the distribution of a given attribute, X. If X is nominal, such as
automobile model or item type, then a pole or vertical bar is drawn for each known value of X. The
height of the bar indicates the frequency (i.e., count) of that X value. The resulting graph is more
commonly known as a bar chart.
If X is numeric, the term histogram is preferred. The range of values for X is partitioned into disjoint
consecutive subranges. The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as the width. Typically, the buckets are of equal width.
For
example, a price attribute with a value range of $1 to $200 (rounded up to the nearest dollar) can be
partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with
a height that represents the total count of items observed within the subrange.
A scatter plot is one of the most effective graphical methods for determining if there appears to be a
relationship, pattern, or trend between two numeric attributes. To construct a scatter plot, each pair of
values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane
The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points
and outliers, or to explore the possibility of correlation relationships. Two attributes, X, and Y, are
correlated if one attribute implies the other. Correlations can be positive, negative, or null
(uncorrelated). If the plotted points pattern slopes from lower left to upper right, this means that the
values of X increase as the values of Y increase, suggesting a positive correlation .If the pattern of
plotted points slopes from upper left to lower right, the values of X increase as the values of Y decrease,
suggesting a negative correlation.
Data Visualization
Data visualization aims to communicate data clearly and effectively through graphical representation.
Data visualization has been used extensively in many applications—for example, at work for reporting,
managing business operations, and tracking progress of tasks. More popularly, we can take advantage of
visualization techniques to discover data relationships that are otherwise not easily observable by
looking at the raw data. Nowadays, people also use data visualization to create fun and interesting
graphics.
A simple way to visualize the value of a dimension is to use a pixel where the color of the pixel reflects
the dimension’s value. For a data set of m dimensions, pixel-oriented techniques create m windows on
the screen, one for each dimension. The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows. The colors of the pixels reflect the corresponding values.
Inside a window, the data values are arranged in some global order shared by all windows. The global
order may be obtained by sorting all data records in a way that’s meaningful for the task at hand.
In pixel-oriented techniques, data records can also be ordered in a query-dependent way. For example,
given a point query, we can sort all records in descending order of similarity to the point query. Filling a
window by laying out the data records in a linear way may not work well for a wide window. The first
pixel in a row is far away from the last pixel in the previous row, though they are next to each other in
the global order. Moreover, a pixel is next to the one above it in the window, even though the two are
not next to each other in the global order. To solve this problem, we can lay out the data records in a
space-filling curve to fill the windows. A space-filling curve is a curve with a range that covers the entire
n-dimensional unit hypercube. Since the visualization windows are 2-D, we can use any 2-D space-filling
curve.
AllElectronics maintains a customer information table, which consists of four dimensions: income, credit
limit, transaction volume, and age. Can we analyze the correlation between income and the other
attributes by visualization? We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows, as shown in Figure 2.10. The pixel colors are
chosen so that the smaller the value, the lighter the shading. Using pixelbased visualization, we can
easily observe the following: credit limit increases as income increases; customers whose income is in
the middle range are more likely to purchase more from AllElectronics; there is no clear correlation
between income and age.
2. Geometric Projection Visualization Techniques
A drawback of pixel-oriented visualization techniques is that they cannot help us much in understanding
the distribution of data in a multidimensional space. For example, they do not show whether there is a
dense area in a multidimensional subspace. Geometric projection techniques help users find interesting
projections of multidimensional data sets. The central challenge the geometric projection techniques try
to address is how to visualize a high-dimensional space on a 2-D display.
A scatter plot displays 2-D data points using Cartesian coordinates. A third dimension can be added
using different colors or shapes to represent different data points. Through this visualization, we can see
that points of types “+” and “×” tend to be colocated.
For data sets with more than four dimensions, scatter plots are usually ineffective. The scatter-plot
matrix technique is a useful extension to the scatter plot. For an ndimensional data set, a scatter-plot
matrix is an n × n grid of 2-D scatter plots that provides a visualization of each dimension with every
other dimension.
The scatter-plot matrix becomes less effective as the dimensionality increases. Another popular
technique, called parallel coordinates, can handle higher dimensionality. To visualize n-dimensional data
points, the parallel coordinates technique draws n equally spaced axes, one for each dimension, parallel
to one of the display axes.
A major limitation of the parallel coordinates technique is that it cannot effectively show a data set of
many records. Even for a data set of several thousand records, visual clutter and overlap often reduce
the readability of the visualization and make the patterns hard to find.
3. Icon-Based Visualization Techniques
Icon-based visualization techniques use small icons to represent multidimensional data values. We look
at two popular icon-based techniques: Chernoff faces and stick figures.
Chernoff faces were introduced in 1973 by statistician Herman Chernoff. They display multidimensional
data of up to 18 variables (or dimensions) as a cartoon human face (Figure 2.17). Chernoff faces help
reveal trends in the data. Components of the face, such as the eyes, ears, mouth, and nose, represent
values of the dimensions by their shape, size, placement, and orientation. For example, dimensions can
be mapped to the following facial characteristics: eye size, eye spacing, nose length, nose width, mouth
curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye eccentricity, and head
eccentricity.
Chernoff faces make use of the ability of the human mind to recognize small differences in facial
characteristics and to assimilate many facial characteristics at once.
Viewing large tables of data can be tedious. By condensing the data, Chernoff faces make the data easier
for users to digest. In this way, they facilitate visualization of regularities and irregularities present in the
data, although their power in relating multiple relationships is limited. Another limitation is that specific
data values are not shown. Furthermore, facial features vary in perceived importance. This means that
the similarity of two faces (representing two multidimensional data points) can vary depending on the
order in which dimensions are assigned to facial characteristics.
Therefore, this mapping should be carefully chosen. Eye size and eyebrow slant have been found to be
important. Asymmetrical Chernoff faces were proposed as an extension to the original technique. Since
a face has vertical symmetry (along the y-axis), the left and right side of a face are identical, which
wastes space. Asymmetrical Chernoff faces double the number of facial characteristics, thus allowing up
to 36 dimensions to be displayed.
Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The
subspaces are visualized in a hierarchical manner. “Worlds-within-Worlds,” also known as n-Vision, is a
representative hierarchical visualization method. Suppose we want to visualize a 6-D data set, where the
dimensions are F,X1,...,X5. We want to observe how dimension F changes with respect to the other
dimensions. We can first fix the values of dimensions X3,X4,X5 to some selected values, say, c3,c4,c5.
We can then visualize F,X1,X2 using a 3-D plot, called a world. The position of the origin of the inner
world is located at the point(c3,c4,c5) in the outer world, which is another 3-D plot using dimensions
X3,X4,X5. A user can interactively change, in the outer world, the location of the origin of the inner
world. The user then views the resulting changes of the inner world. Moreover, a user can vary the
dimensions used in the inner world and the outer world. Given more dimensions, more levels of worlds
can be used, which is why the method is called “worlds-withinworlds.”
5. Visualizing Complex Data and Relations
In early days, visualization techniques were mainly for numeric data. Recently, more and more non-
numeric data, such as text and social networks, have become available. Visualizing and analyzing such
data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data. For example, many
people on the Web tag various objects such as pictures, blog entries, and product reviews. A tag cloud is
a visualization of statistics of user-generated tags. Often, in a tag cloud, tags are listed alphabetically or
in a user-preferred order.The importance of a tag is indicated by font size or color. Figure 2.21 shows a
tag cloud for visualizing the popular tags used in a Web site.
Tag clouds are often used in two ways. First, in a tag cloud for a single item, we can use the size of a tag
to represent the number of times that the tag is applied to this item by different users. Second, when
visualizing the tag statistics on multiple items, we can use the size of a tag to represent the number of
items that the tag has been applied to, that is, the popularity of the tag.
In addition to complex data, complex relations among data entries also raise challenges for
visualization. For example, Figure 2.22 uses a disease influence graph to visualize the correlations
between diseases. The nodes in the graph are diseases, and the size of each node is proportional to the
prevalence of the corresponding disease. Two nodes are linked by an edge if the corresponding diseases
have a strong correlation. The width of an edge is proportional to the strength of the correlation pattern
of the two corresponding diseases.
Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest-neighbor classification, we
need ways to assess how alike or unalike objects are in comparison to one another. For example, a store
may want to search for clusters of customer objects, resulting in groups of customers with similar
characteristics (e.g., similar income, area of residence, and age). Such information can then be used for
marketing.
A cluster is a collection of data objects such that the objects within a cluster are similar to one another
and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-based techniques
to identify potential outliers as objects that are highly dissimilar to others. Knowledge of object
similarities can also be used in nearest-neighbor classification schemes where a given object (e.g., a
patient) is assigned a class label (relating to, say, a diagnosis) based on its similarity toward other objects
in the model.
This section presents similarity and dissimilarity measures, which are referred to as measures of
proximity. Similarity and dissimilarity are related. A similarity measure for two objects, i and j, will
typically return the value 0 if the objects are unalike. The higher the similarity value, the greater the
similarity between objects. (Typically, a value of 1 indicates complete similarity, that is, the objects are
identical.) A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are the
same (and therefore, far from being dissimilar). The higher the dissimilarity value, the more dissimilar
the two objects are.
Each row corresponds to an object. As part of our notation, we may use f to index through the p
attributes.
Dissimilarity matrix (or object-by-object structure): This structure stores a collection of proximities that
are available for all pairs of n objects. It is often represented by an n-by-n table:
where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In general, d(i, j) is a
non-negative number that is close to 0 when objects i and j are highly similar or “near” each other, and
becomes larger the more they differ. Note that d(i, i) = 0; that is, the difference between an object and
itself is 0. Furthermore, d(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is
symmetric.)
A data matrix is made up of two entities or “things,” namely rows (for objects) and columns (for
attributes). Therefore, the data matrix is often called a two-mode matrix. The dissimilarity matrix
contains one kind of entity (dissimilarities) and so is called a one-mode matrix. Many clustering and
nearest-neighbor algorithms operate on a dissimilarity matrix. Data in the form of a data matrix can be
transformed into a dissimilarity matrix before applying such algorithms.
A nominal attribute can take on two or more states . For example, map color is a nominal attribute that
may have, say, five states: red, yellow, green, pink, and blue.
Let the number of states of a nominal attribute be M. The states can be denoted by letters, symbols, or
a set of integers, such as 1, 2,..., M. Notice that such integers are used just for data handling and do not
represent any specific ordering.
The dissimilarity between two objects i and j can be computed based on the ratio of mismatches,
where m is the number of matches (i.e., the number of attributes for which i and j are in the same
state), and p is the total number of attributes describing the objects. Weights can be assigned to
increase the effect of m or to assign greater weight to the matches in attributes having a larger number
of states.
3. Proximity Measures for Binary Attributes
Binary attribute has only one of two states: 0 and 1, where 0 means that the attribute is absent, and 1
means that it is present (Section 2.1.3). Given the attribute smoker describing a patient, for instance, 1
indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary attributes
as if they are numeric can be misleading. Therefore, methods specific to binary data are necessary for
computing dissimilarity.
One approach involves computing a dissimilarity matrix from the given binary data. If all binary
attributes are thought of as having the same weight, we have the 2 × 2 contingency table of Table 2.3,
where q is the number of attributes that equal 1 for both objects i and j, r is the number of attributes
that equal 1 for object i but equal 0 for object j, s is the number of attributes that equal 0 for object i but
equal 1 for object j, and t is the number of attributes that equal 0 for both objects i and j. The total
number of attributes is p, where p = q + r + s + t
Dissimilarity that is based on symmetric binary attributes is called symmetric binary dissimilarity. If
objects i and j are described by symmetric binary attributes, then the dissimilarity between i and j is,
Cosine Similarity
Applications using such structures include information retrieval, text document clustering, biological
taxonomy, and gene feature mapping. The traditional distance measures that we have studied in this
chapter do not work well for such sparse numeric data. For example, two term-frequency vectors may
have many 0 values in common, meaning that the corresponding documents do not share many words,
but this does not make them similar.We need a measure that will focus on the words that the two
documents do have in common, and the occurrence frequency of such words. In other words, we need a
measure for numeric data that ignores zero-matches. Cosine similarity is a measure of similarity that can
be used to compare documents or, say, give a ranking of documents with respect to a given vector of
query words. Let x and y be two vectors for comparison.
DATA PREPROCESSING
Data have quality if they satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
interpretability.
Accuracy, completeness, and consistency. Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world databases and data warehouses. There are many possible
reasons for inaccurate data (i.e., having incorrect attribute values). The data collection instruments used
may be faulty. There may have been human or computer errors occurring at data entry. Users may
purposely submit incorrect data values for mandatory fields when they do not wish to submit personal
information (e.g., by choosing the default value “January 1” displayed for birthday). This is known as
disguised missing data. Errors in data transmission can also occur. There may be technology limitations
such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data
may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for
input fields (e.g., date). Duplicate tuples also require data cleaning.
Incomplete data can occur for a number of reasons. Attributes of interest may not always be available,
such as customer information for sales transaction data. Other data may not be included simply because
they were not considered important at the time of entry. Relevant data may not be recorded due to a
misunderstanding or because of equipment malfunctions. Data that were inconsistent with other
recorded data may have been deleted. Furthermore, the recording of the data history or modifications
may have been overlooked. Missing data, particularly for tuples with missing values for some attributes,
may need to be inferred.
Timeliness also affects data quality. Suppose that you are overseeing the distribution of monthly sales
bonuses to the top sales representatives at AllElectronics. Several sales representatives, however, fail to
submit their sales records on time at the end of the month. There are also a number of corrections and
adjustments that flow in after the month’s end. For a period of time following each month, the data
stored in the database are incomplete. However, once all of the data are received, it is correct. The fact
that the month-end data are not updated in a timely fashion has a negative impact on the data quality.
Two other factors affecting data quality are believability and interpretability. Believability reflects how
much the data are trusted by users, while interpretability reflects how easy the data are understood.
Suppose that a database, at one point, had several errors, all of which have since been corrected. The
past errors, however, had caused many problems for sales department users, and so they no longer
trust the data. The data also use many accounting codes, which the sales department does not know
how to interpret. Even though the database is now accurate, complete, consistent, and timely, sales
department users may regard it as of low quality due to poor believability and interpretability.
Major Tasks in Data Preprocessing
1. data cleaning,
2. data integration,
3. data reduction,
4. data transformation.
i. Data cleaning
routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies. If users believe the data
are dirty, they are unlikely to trust the results of any data mining that has been applied.
Furthermore, dirty data can cause confusion for the mining procedure, resulting in
unreliable output. Although most mining routines have some procedures for dealing with
incomplete or noisy data, they are not always robust. Instead, they may concentrate on
avoiding overfitting the data to the function being modeled. Therefore, a useful
preprocessing step is to run your data through some data cleaning routines.
ii. Data integration,
Getting back to your task at AllElectronics, suppose that you would like to include data from
multiple sources in your analysis. This would involve integrating multiple databases, data
cubes, or files (i.e., data integration). Yet some attributes representing a given concept may
have different names in different databases, causing inconsistencies and redundancies. For
example, the attribute for customer identification may be referred to as customer id in one
data store and cust id in another. Naming inconsistencies may also occur for attribute
values. For example, the same first name could be registered as “Bill” in one database,
“William” in another, and “B.” in a third. Furthermore, you suspect that some attributes may
be inferred from others (e.g., annual revenue). Having a large amount of redundant data
may slow down or confuse the knowledge discovery process. Clearly, in addition to data
cleaning, steps must be taken to help avoid redundancies during data integration. Typically,
data cleaning and data integration are performed as a preprocessing step when preparing
data for a data warehouse. Additional data cleaning can be performed to detect and remove
redundancies that may have resulted from data integration.
iii. Data reduction
Data reduction obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. Data reduction
strategies include dimensionality reduction and numerosity reduction. In dimensionality
reduction, data encoding schemes are applied so as to obtain a reduced or “compressed”
representation of the original data. Examples include data compression techniques (e.g.,
wavelet transforms and principal components analysis), attribute subset selection (e.g.,
removing irrelevant attributes), and attribute construction (e.g., where a small set of more
useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representations using
parametric models (e.g., regression or log-linear models) or nonparametric models (e.g.,
histograms, clusters, sampling, or data aggregation).
iv. Data Transformation
Getting back to your data, you have decided, say, that you would like to use a distancebased
mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or
clustering.1 Such methods provide better results if the data to be analyzed have been
normalized, that is, scaled to a smaller range such as [0.0, 1.0]. Your customer data, for
example, contain the attributes age and annual salary. The annual salary attribute usually
takes much larger values than age. Therefore, if the attributes are left unnormalized, the
distance measurements taken on annual salary will generally outweigh distance
measurements taken on age.
Discretization and concept hierarchy generation can also be useful, where raw data values
for attributes are replaced by ranges or higher conceptual levels. For example, raw values
for age may be replaced by higher-level concepts, such as youth, adult, or senior.
Discretization and concept hierarchy generation are powerful tools for data mining in that
they allow data mining at multiple abstraction levels. Normalization, data discretization, and
concept hierarchy generation are forms of data transformation.
Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data. In this section, you will study basic methods
for data cleaning.
Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. Let’s look at the
following methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably. By ignoring the tuple, we do not make use
of the remaining attributes’ values in the tuple. Such data could have been useful to the task
at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and may
not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant such as a label like “Unknown” or −∞. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that of “Unknown.” Hence,
although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in
the missing value: Measures of central tendency, which indicate the “middle” value of a
data distribution. For normal (symmetric) data distributions, the mean can be used, while
skewed data distribution should employ the median .
5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit risk category
as that of the given tuple. If the data distribution for a given class is skewed, the median
value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
Noisy Data
Noise is a random error or variance in a measured variable.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it. The sorted values are distributed into a number of “buckets,” or
bins. Because binning methods consult the neighborhood of values, they perform local
smoothing. In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by the value 9.
Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “best” line to fit two attributes
(or variables) so that one attribute can be used to predict the other. Multiple linear
regression is an extension of linear regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
The first step in data cleaning as a process is discrepancy detection. Discrepancies can be
caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting
to divulge information about themselves), and data decay (e.g., outdated addresses).
Discrepancies may also arise from inconsistent data representations and inconsistent use
of codes. Other sources of discrepancies include errors in instrumentation devices that
record data and system errors. Errors can also occur when the data are (inadequately)
used for purposes other than originally intended. There may also be inconsistencies due to
data integration. As a starting point, use any knowledge you may already have regarding
properties of the data. Such knowledge or “data about data” is referred to as metadata.
As a data analyst, you should be on the lookout for the inconsistent use of codes and any
inconsistent data representations (e.g., “2010/12/25” and “25/12/2010” for date). Field
overloading is another error source that typically results when developers squeeze new
attribute definitions into unused (bit) portions of already defined attributes (e.g., an
unused bit of an attribute that has a value range that uses only, say, 31 out of 32 bits).
The data should also be examined regarding unique rules, consecutive rules, and null
rules. A unique rule says that each value of the given attribute must be different from all
other values for that attribute. A consecutive rule says that there can be no missing values
between the lowest and highest values for the attribute, and that all values must also be
unique (e.g., as in check numbers). A null rule specifies the use of blanks, question marks,
special characters, or other strings that may indicate the null condition (e.g., where a value
for a given attribute is not available), and how such values should be handled.
There are a number of different commercial tools that can aid in the discrepancy
detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge of
postal addresses and spell-checking) to detect errors and make corrections in the data.
These tools rely on parsing and fuzzy matching techniques when cleaning data from
multiple sources. Data auditing tools find discrepancies by analyzing the data to discover
rules and relationships, and detecting data that violate such conditions. They are variants
of data mining tools.
Commercial tools can assist in the data transformation step. Data migration tools allow
simple transformations to be specified such as to replace the string “gender” by “sex.” ETL
(extraction/transformation/loading) tools allow users to specify transforms through a
graphical user interface (GUI). These tools typically support only a restricted set of
transforms so that, often, we may also choose to write custom scripts for this step of the
data cleaning process.
The two-step process of discrepancy detection and data transformation (to correct
discrepancies) iterates. This process, however, is error-prone and time consuming. Some
transformations may introduce more discrepancies. Some nested discrepancies may only
be detected after others have been fixed. For example, a typo such as “20010” in a year
field may only surface once all date values have been converted to a uniform format.
Transformations are often done as a batch process while the user waits without feedback.
Only after the transformation is complete can the user go back and check that no new
anomalies have been mistakenly created. Typically, numerous iterations are required
before the user is satisfied. Any tuples that cannot be automatically handled by a given
transformation are typically written to a file without any explanation regarding the
reasoning behind their failure. As a result, the entire data cleaning process also suffers
from a lack of interactivity.
New approaches to data cleaning emphasize increased interactivity. Potter’s Wheel, for
example, is a publicly available data cleaning tool that integrates discrepancy detection
and transformation. Users gradually build a series of transformations by composing and
debugging individual transformations, one step at a time, on a spreadsheet-like interface.
The transformations can be specified graphically or by providing examples. Results are
shown immediately on the records that are visible on the screen. The user can choose to
undo the transformations, so that transformations that introduced additional errors can be
“erased.” The tool automatically performs discrepancy checking in the background on the
latest transformed view of the data. Users can gradually develop and refine
transformations as discrepancies are found, leading to more effective and efficient data
cleaning.
Data Integration
Data mining often requires data integration—the merging of data from multiple data
stores. Careful integration can help reduce and avoid redundancies and inconsistencies in
the resulting data set. This can help improve the accuracy and speed of the subsequent
data mining process. The semantic heterogeneity and structure of data pose great
challenges in data integration. How can we match schema and objects from different
sources? This is the essence of the entity identification problem.
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources
may include multiple databases, data cubes, or flat files. There are a number of issues to
consider during data integration. Schema integration and object matching can be tricky.
This is referred to as the entity identification problem. For example, how can the data
analyst or the computer be sure that customer id in one database and cust number in
another refer to the same attribute? Examples of metadata for each attribute include the
name, meaning, data type, and range of values permitted for the attribute, and null rules
for handling blank, zero, or null values (Section 3.2).
Such metadata can be used to help avoid errors in schema integration. The metadata may
also be used to help transform the data (e.g., where data codes for pay type in one
database may be “H” and “S” but 1 and 2 in another). When matching attributes from one
database to another during integration, special attention must be paid to the structure of
the data. This is to ensure that any attribute functional dependencies and referential
constraints in the source system match those in the target system. For example, in one
system, a discount may be applied to the order, whereas in another system it is applied to
each individual line item within the order. If this is not caught before integration, items in
the target system may be improperly discounted.
3. Tuple Duplication
Data integration also involves the detection and resolution of data value conflicts. For
example, for the same real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding. For instance, a
weight attribute may be stored in metric units in one system and British imperial units in
another. For a hotel chain, the price of rooms in different cities may involve not only
different currencies but also different services (e.g., free breakfast) and taxes. When
exchanging information between schools, for example, each school may have its own
curriculum and grading scheme. One university may adopt a quarter system, offer three
courses on database systems, and assign grades from A+ to F, whereas another may adopt a
semester system, offer two courses on databases, and assign grades from 1 to 10. It is
difficult to work out precise course-to-grade transformation rules between the two
universities, making information exchange difficult. Attributes may also differ on the
abstraction level, where an attribute in one system is recorded at, say, a lower abstraction
level than the “same” attribute in another. For example, the total sales in one database may
refer to one branch of All Electronics, while an attribute of the same name in another
database may refer to the total sales for All Electronics stores in a given region.
Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost
the same) analytical results.
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Regression and log-linear models
are examples. Nonparametric methods for storing reduced representations of the data
include histograms , clustering, sampling , and data cube aggregation.
a. Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X 0 , of wavelet
coefficients. The two vectors are of the same length. When applying this technique to data
reduction, we consider each tuple as an n-dimensional data vector, that is, X = (x1,x2,...,xn),
depicting n measurements made on the tuple from n database attributes.
The general procedure for applying a discrete wavelet transform uses a hierarchical
pyramid algorithm that halves the data at each iteration, resulting in fast computational
speed. The method is as follows:
1. The length, L, of the input data vector must be an integer power of 2. This condition can
be met by padding the data vector with zeros as necessary (L ≥ n).
2. Each transform involves applying two functions. The first applies some data smoothing,
such as a sum or weighted average. The second performs a weighted difference, which acts
to bring out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i ,x2i+1). This results in two data sets of length L/2. In general, these
represent a smoothed or low-frequency version of the input data and the highfrequency
content of it, respectively.
4. The two functions are recursively applied to the data sets obtained in the previous loop,
until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the previous iterations are designated the
wavelet coefficients of the transformed data.
Principal components analysis (PCA; also called the Karhunen-Loeve, or K-L, method)
searches for k n-dimensional orthogonal vectors that can best be used to represent the
data, where k ≤ n. The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction. Unlike attribute subset selection (Section 3.4.4),
which reduces the attribute set size by retaining a subset of the initial set of attributes,
PCA “combines” the essence of attributes by creating an alternative, smaller set of
variables. The initial data can then be projected onto this smaller set.
PCA often reveals relationships that were not previously suspected and thereby allows
interpretations that would not ordinarily result. The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components. The input data are a linear
combination of the principal components.
4. Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Using the strongest principal components, it should be possible to reconstruct a good
approximation of the original data.
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant. For example, if the task is to classify customers
based on whether or not they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer’s telephone number are likely to
be irrelevant, unlike attributes such as age or music taste. Although it may be possible for a
domain expert to pick out some of the useful attributes, this can be a difficult and
timeconsuming task, especially when the data’s behavior is not well known. (Hence, a
reason behind its analysis!) Leaving out relevant attributes or keeping irrelevant attributes
may be detrimental, causing confusion for the mining algorithm employed. This can result
in discovered patterns of poor quality. In addition, the added volume of irrelevant or
redundant attributes can slow down the mining process.
Attribute subset selection4 reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set
of attributes such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining on a reduced set
of attributes has an additional benefit: It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier to understand.
i. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
Stepwise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchartlike
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction. At each node, the algorithm chooses the “best” attribute to partition the data
into individual classes.
Parametric Data Reduction Regression and log-linear models can be used to approximate
the given data. In (simple) linear regression, the data are modeled to fit a straight line. For
example, a random variable, y (called a response variable), can be modeled as a linear
function of another random variable, x (called a predictor variable), with the equation
y = wx + b,
where the variance of y is assumed to be constant. In the context of data mining, x and y
are numeric database attributes. The coefficients, w and b (called regression coefficients)
specify the slope of the line and the y-intercept, respectively. These coefficients can be
solved for by the method of least squares, which minimizes the error between the actual
line separating the data and the estimate of the line. Multiple linear regression is an
extension of (simple) linear regression, which allows a response variable, y, to be modeled
as a linear function of two or more predictor variables
e. Histograms
Histograms use binning to approximate data distributions and are a popular form of data
reduction. Histograms were introduced in. A histogram for an attribute, A, partitions the
data distribution of A into disjoint subsets, referred to as buckets or bins. If each bucket
represents only a single attribute–value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous ranges for the given attribute.
f. Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups, or clusters, so that objects within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters. Similarity is commonly defined in terms of how
“close” the objects are in space, based on a distance function. The “quality” of a cluster
may be represented by its diameter, the maximum distance between any two objects in
the cluster. Centroid distance is an alternative measure of cluster quality and is defined as
the average distance of each cluster object from the cluster centroid (denoting the
“average object,” or average point in space for the cluster). Figure 3.3 showed a 2-D plot of
customer data with respect to customer locations in a city. Three data clusters are visible.
In data reduction, the cluster representations of the data are used to replace the actual
data. The effectiveness of this technique depends on the data’s nature.
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy for the
attribute price. More than one concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level.
Data discretization and concept hierarchy generation are also forms of data reduction. The
raw data are replaced by a smaller number of interval or concept labels. This simplifies the
original data and makes the mining more efficient. The resulting patterns mined are
typically easier to understand. Concept hierarchies are also useful for mining at multiple
abstraction levels.
The measurement unit used can affect the data analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to very different results. In general, expressing an attribute in smaller
units will lead to a larger range for that attribute, and thus tend to give such an attribute
greater effect or “weight.” To help avoid dependence on the choice of measurement units,
the data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize and
normalize are used interchangeably in data preprocessing, although in statistics, the latter
term also has other connotations.) Normalizing the data attempts to give all attributes an
equal weight. Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor classification and
clustering.
If using the neural network backpropagation algorithm for classification mining (Chapter
9), normalizing the input values for each attribute measured in the training tuples will help
speed up the learning phase. For distance-based methods, normalization helps
preventattributes with initially large ranges (e.g., income) from outweighing attributes
with initially smaller ranges (e.g., binary attributes). It is also useful when given no prior
knowledge of the data.
Discretization by Binning
Clustering can be used to generate a concept hierarchy for A by following either a top-
down splitting strategy or a bottom-up merging strategy, where each cluster forms a node
of the concept hierarchy. In the former, each initial cluster or partition may be further
decomposed into several subclusters, forming a lower level of the hierarchy. In the latter,
clusters are formed by repeatedly grouping neighboring clusters in order to form higher-
level concepts.
We now look at data transformation for nominal data. In particular, we study concept
hierarchy generation for nominal attributes. Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering among the values. Examples include
geographic location, job category, and item type. Manual definition of concept hierarchies
can be a tedious and time-consuming task for a user or a domain expert. Fortunately,
many hierarchies are implicit within the database schema and can be automatically
defined at the schema definition level. The concept hierarchies can be used to transform
the data into multiple levels of granularity. For example, data mining patterns regarding
sales may be found relating to specific regions or countries, in addition to individual branch
locations. We study four methods for the generation of concept hierarchies for nominal
data, as follows.
3. Specification of a set of attributes, but not of their partial ordering: A user may specify
a set of attributes forming a concept hierarchy, but omit to explicitly state their partial
ordering. The system can then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy. “Without knowledge of data semantics, how
can a hierarchical ordering for an arbitrary set of nominal attributes be found?” Consider
the observation that since higher-level concepts generally cover several subordinate
lower-level concepts, an attribute defining a high concept level (e.g., country) will usually
contain a smaller number of distinct values than an attribute defining a lower concept
level (e.g., street). Based on this observation, a concept hierarchy can be automatically
generated based on the number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest hierarchy level. The
lower the number of distinct values an attribute has, the higher it is in the generated
concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless when
defining a hierarchy, or have only a vague idea about what should be included in a
hierarchy. Consequently, the user may have included only a small subset of the relevant
attributes in the hierarchy specification. For example, instead of including all of the
hierarchically relevant attributes for location, the user may have specified only street and
city. To handle such partially specified hierarchies, it is important to embed data semantics
in the database schema so that attributes with tight semantic connections can be pinned
together. In this way, the specification of one attribute may trigger a whole group of
semantically tightly linked attributes to be “dragged in” to form a complete hierarchy.
Users, however, should have the option to override this feature, as necessary.
Unit 4. Classification and association rule mining
1. Classification basics,
supervised Vs unsupervised learning,
and Prediction.
2. Issues Regarding Classification and Prediction.
3. Classification by Decision Tree Introduction:
what is decision tree?
Algorithm for Decision Tree Induction,
Attribute Selection Measure,
Extracting Classification Rules from Trees,
Approaches to Determine the Final Tree Size,
Enhancements to basic decision tree induction.
4. Association rule mining :
Basics,
Mining single-dimensional Boolean association rules from transactional databases,
Mining multilevel association rules from transactional databasesMining multidimensional
association rules from transactional databases and data warehouse.
What Is Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and
which are “risky” for the bank. A marketing manager at AllElectronics needs data analysis to help guess
whether a customer with a given profile will buy a new computer. A medical researcher wants to
analyze breast cancer data to predict which one of three specific treatments a patient should receive. In
each of these examples, the data analysis task is classification, where a model or classifier is constructed
to predict class (categorical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no”
for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data. These
categories can be represented by discrete values, where the ordering among values has no meaning. For
example, the values 1, 2, and 3 may be used to represent treatments A, B, and C, where there is no
ordering implied among this group of treatment regimes.
Suppose that the marketing manager wants to predict how much a given customer will spend
during a sale at AllElectronics. This data analysis task is an example of numeric prediction, where the
model constructed predicts a continuous-valued function, or ordered value, as opposed to a class label.
This model is a predictor. Regression analysis is a statistical methodology that is most often used for
numeric prediction; hence the two terms tend to be used synonymously, although other methods for
numeric prediction exist. Classification and numeric prediction are the two major types of prediction
problems. This chapter focuses on classification.
In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the
learning step (or training phase), where a classification algorithm builds the classifier by analyzing or
“learning from” a training set made up of database tuples and their associated class labels. A tuple, X, is
represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements made on
the tuple from n database attributes, respectively, A1, A2,..., An. 1 Each tuple, X, is assumed to belong to
a predefined class as determined by another database attribute called the class label attribute.
The class label attribute is discrete-valued and unordered. It is categorical (or nominal) in that each
value serves as a category or class. The individual tuples making up the training set are referred to as
training tuples and are randomly sampled from the database under analysis. In the context of
classification, data tuples can be referred to as samples, examples, instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as supervised learning
(i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple
belongs).
It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is
not known, and the number or set of classes to be learned may not be known in advance.
For example, if we did not have the loan decision data available for the training set, we could use
clustering to try to determine “groups of like tuples,” which may correspond to risk groups within the
loan application data. Clustering is the topic of Chapters 10 and 11. This first step of the classification
process can also be viewed as the learning of a mapping or function, y = f (X), that can predict the
associated class label y of a given tuple X. In this view, we wish to learn a mapping or function that
separates the data classes. Typically, this mapping is represented in the form of classification rules,
decision trees, or mathematical formulae. In our example, the mapping is represented as classification
rules that identify loan applications as being either safe or risky (Figure 8.1a). The rules can be used to
categorize future data tuples, as well as provide deeper insight into the data contents. They also provide
a compressed data representation.
“What about classification accuracy?” the model is used for classification. First, the predictive accuracy
of the classifier is estimated. If we were to use the training set to measure the classifier’s accuracy, this
estimate would likely be optimistic, because the classifier tends to overfit the data (i.e., during learning
it may incorporate some particular anomalies of the training data that are not present in the general
data set overall). Therefore, a test set is used, made up of test tuples and their associated class labels.
They are independent of the training tuples, meaning that they were not used to construct the classifier.
The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. The associated class label of each test tuple is compared with the learned
classifier’s class prediction for that tuple. Section 8.5 describes several methods for estimating classifier
accuracy. If the accuracy of the classifier is considered acceptable, the classifier can be used to classify
future data tuples for which the class label is not known. (Such data are also referred to in the machine
learning literature as “unknown” or “previously unseen” data.) For example, the classification rules
learned in Figure 8.1(a) from the analysis of data from previous loan applications can be used to approve
or reject new or future loan applicants.
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser). This work expanded on earlier work on
concept learning systems, described by E. B. Hunt, J. Marin, and P. T. Stone. Quinlan later presented
C4.5 (a successor of ID3), which became a benchmark to which newer supervised learning algorithms
are often compared. In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone)
published the book Classification and Regression Trees (CART), which described the generation of binary
decision trees. ID3 and CART were invented independently of one another at around the same time, yet
follow a similar approach for learning decision trees from training tuples. These two cornerstone
algorithms spawned a flurry of work on decision tree induction. ID3, C4.5, and CART adopt a greedy (i.e.,
nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-
conquer manner.
Most algorithms for decision tree induction also follow a top-down approach, which starts with a
training set of tuples and their associated class labels. The training set is recursively partitioned into
smaller subsets as the tree is being built. A basic decision tree algorithm is summarized in Figure 8.3. At
first glance, the algorithm may appear long, but fear not! It is quite straightforward. The strategy is as
follows. The algorithm is called with three parameters: D, attribute list, and Attribute selection method.
We refer to D as a data partition. Initially, it is the complete set of training tuples and their associated
class labels. The parameter attribute list is a list of attributes describing the tuples.
Attribute selection method specifies a heuristic procedure for selecting the attribute that “best”
discriminates the given tuples according to class. This procedure employs an attribute selection measure
such as information gain or the Gini index. Whether the tree is strictly binary is generally driven by the
attribute selection measure. Some attribute selection measures, such as the Gini index, enforce the
resulting tree to be binary. Others, like information gain, do not, therein allowing multiway splits (i.e.,
two or more branches to be grown from a node).
This criterion consists of a splitting attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
Method:
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not restricted to binary
trees
(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion // partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(14) else attach the node returned by Generate decision tree(Dj , attribute list) to node N; endfor
(15) return N
An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a
given data partition, D, of class-labeled training tuples into individual classes. If we were to split D into
smaller partitions according to the outcomes of the splitting criterion, ideally each partition would be
pure (i.e., all the tuples that fall into a given partition would belong to the same class). Conceptually, the
“best” splitting criterion is the one that most closely results in such a scenario. Attribute selection
measures are also known as splitting rules because they determine how the tuples at a given node are
to be split. The attribute selection measure provides a ranking for each attribute describing the given
training tuples. The attribute having the best score for the measure4 is chosen as the splitting attribute
for the given tuples. If the splitting attribute is continuous-valued or if we are restricted to binary trees,
then, respectively, either a split point or a splitting subset must also be determined as part of the
splitting criterion.
The tree node created for partition D is labeled with the splitting criterion, branches are grown for each
outcome of the criterion, and the tuples are partitioned accordingly. This section describes three
popular attribute selection measures—information gain, gain ratio, and Gini index. The notation used
here in is as follows. Let D, the data partition, be a training set of class-labeled tuples. Suppose the class
label attribute has m distinct values defining m distinct classes, Ci (for i = 1,..., m). Let Ci,D be the set of
tuples of class Ci in D. Let |D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively.
Information Gain
ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work
by Claude Shannon on information theory, which studied the value or “information content” of
messages. Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute minimizes the information
needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity”
in these partitions. Such an approach minimizes the expected number of tests needed to classify a given
tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
𝑖𝑛𝑓𝑜(𝐷) = ∑ 𝑝𝑖 𝑙𝑜𝑔2(𝑝𝑖)
𝑖=1
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by
|Ci,D|/|D|. A log function to the base 2 is used, because the information is encoded in bits. Info(D) is
just the average amount of information needed to identify the class label of a tuple in D. Note that, at
this point, the information we have is based solely on the proportions of tuples of each class. Info(D) is
also known as the entropy of D.
Information gain is defined as the difference between the original information requirement (i.e., based
on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That
is,
In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected
reduction in the information requirement caused by knowing the value of A. The attribute A with the
highest information gain, Gain(A), is chosen as the splitting attribute at node N. This is equivalent to
saying that we want to partition on the attribute A that would do the “best classification,” so that the
amount of information still required to finish classifying the tuples is minimal (i.e., minimum InfoA (D)).
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it prefers to select
attributes having a large number of values. For example, consider an attribute that acts as a unique
identifier such as product ID. A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple. Because each partition is pure, the information
required to classify data set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the
information gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless for
classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to
overcome this bias. It applies a kind of normalization to information gain using a “split information”
value defined analogously with Info(D) as
𝑣
𝑑𝑗 𝑑𝑗
𝑠𝑝𝑙𝑖𝑡𝑖𝑛𝑓𝑜(𝐷) = ∑ × 𝑙𝑜𝑔2( )
𝑑 𝑑
𝑗=1
This value represents the potential information generated by splitting the training data set, D,
into v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome,
it considers the number of tuples having that outcome with respect to the total number of tuples in D. It
differs from information gain, which measures the information with respect to classification that is
acquired based on the same partitioning.
The gain ratio is defined as,
𝐺𝑎𝑖𝑛(𝐴)
GainRatio(A) = 𝑠𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑎(𝐷)
The attribute with the maximum gain ratio is selected as the splitting attribute. Note, however, that as
the split information approaches 0, the ratio becomes unstable. A constraint is added to avoid this,
whereby the information gain of the test selected must be large—at least as great as the average gain
over all tests examine.
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples, as
𝑚
𝐺𝑖𝑛𝑖(𝐷) = 1 − ∑ 𝑝𝑖 2
𝑖=1
where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum
is computed over m classes. The Gini index considers a binary split for each attribute. Let’s first consider
the case where A is a discrete-valued attribute having v distinct values, {a1, a2,..., av }, occurring in D. To
determine the best binary split on A, we examine all the possible subsets that can be formed using
known values of A.
Each subset, SA, can be considered as a binary test for attribute A of the form “A ∈ SA?” Given a tuple,
this test is satisfied if the value of A for the tuple is among the values listed in SA. If A has v possible
values, then there are 2v possible subsets. . For example, if income has three possible values, namely
{low, medium, high}, then the possible subsets are {low, medium, high}, {low, medium}, {low, high},
{medium, high}, {low}, {medium}, {high}, and {}. We exclude the power set, {low, medium, high}, and the
empty set from consideration since, conceptually, they do not represent a split. Therefore, there are 2v
− 2 possible ways to form two partitions of the data, D, based on a binary split on A.
Other attribute selection measures consider multivariate splits (i.e., where the partitioning of tuples is
based on a combination of attributes, rather than on a single attribute). The CART system, for example,
can find multivariate splits based on a linear combination of attributes. Multivariate splits are a form of
attribute (or feature) construction, where new attributes are created based on the existing ones.
Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training data due to
noise or outliers. Tree pruning methods address this problem of overfitting the data. Such methods
typically use statistical measures to remove the least-reliable branches. An unpruned tree and a pruned
version of it are shown in Figure 8.6. Pruned trees tend to be smaller and less complex and, thus, easier
to comprehend. They are usually faster and better at correctly classifying independent test data (i.e., of
previously unseen tuples) than unpruned trees.
In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., by deciding not to
further split or partition the subset of training tuples at a given node). Upon halting, the node becomes a
leaf. The leaf may hold the most frequent class among the subset tuples or the probability distribution
of those tuples.