You are on page 1of 65

Downloaded from www.rgpvnotes.

in, whatsapp: 8989595022

COURSE OBJECTIVES
1. Student should understand the value of historical data and data mining in solving real-world
problems.
2. Student should become affluent with the basic supervised and unsupervised learning algorithms
commonly used in data mining.
3. Student develops the skill in using data mining for solving real-world problems.
Unit-I
Data Warehousing: Introduction, Delivery Process, Data warehouse Architecture, Data
Preprocessing: Data cleaning, Data Integration and transformation, Data reduction. Data
warehouse Design: Data warehouse schema, Partitioning strategy Data warehouse
Implementation, Data Marts, Meta Data, Example of a Multidimensional Data model.
Introduction to Pattern Warehousing.
-------------------
Data Warehousing
A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. DW is combining data from multiple and usually varied sources in to one
comprehensive and easily manipulated database. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis workload from
transaction workload and enables an organization to consolidate data from several sources. DW
is commonly used by companies to analyze trends over time.
As compare to the relational database, a data warehouse environment includes an extraction,
transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP)
engine, client analysis tools, and other applications that manage the process of gathering data
and delivering it to business users.
“A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data
in support of management’s decision-making process”
Understanding a Data Warehouse
➢ A data warehouse is a database, which is kept separate from the organization's
operational database.
➢ There is no frequent updating done in a data warehouse.
➢ It possesses consolidated historical data, which helps the organization to analyze its
business.
➢ A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
➢ Data warehouse systems help in the integration of diversity of application systems.
➢ A data warehouse system helps in consolidated historical data analysis.

Data Warehouse Features


The key features of a data warehouse are discussed below −
Subject Oriented − A data warehouse is subject oriented because it provides information around
a subject rather than the organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing
operations, rather it focuses on modelling and analysis of data for decision making.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Integrated − A data warehouse is constructed by integrating data from heterogeneous sources


such as relational databases, flat files, etc. This integration enhances the effective analysis of
data.
Time Variant − Data collected in a data warehouse is identified with a particular time period. The
data in a data warehouse provides information from the historical point of view.
Non-volatile − Non-volatile means the previous data is not erased when new data is added to it.
A data warehouse is kept separate from the operational database and therefore frequent
changes in operational database is not reflected in the data warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications


As discussed before, a data warehouse helps business executives to organize, analyze, and use
their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess
"closed-loop" feedback system for the enterprise management. Data warehouses are widely
used in the following fields −
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
Needs for developing data Warehouse
▪ Provides an integrated and total view of the enterprise.
▪ Make the organizations current and historical information easily available for decision
making.
▪ Make decision support transactions possible without hampering operational system.
▪ Provide consistent organizations information
▪ Provide a flexible and interactive sources of strategic information.
▪ End user creation of reports: The creation of reports directly by end users is much easier to
accomplish in a BI environment.
▪ Dynamic presentation through dashboards: Managers want access to an interactive display
of up-to-date critical management data.
▪ Drill-down capability
▪ Metadata creation: This will make report creation much simpler for the end-user

Delivery Process
The delivery method is a variant of the joint application development approach adopted for the
delivery of a data warehouse. We have staged the data warehouse delivery process to minimize
risks. The approach that we will discuss here does not reduce the overall delivery timescales but
ensures the business benefits are delivered incrementally through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
The following diagram explains the stages in the delivery process −

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 1: Stages in the delivery process

Data Warehouse Architectures


Data warehouses and their architectures vary depending upon the specifics of an organization's
situation. Three common architectures are:
• Data Warehouse Architecture: Basic
• Data Warehouse Architecture with a Staging Area
• Data Warehouse Architecture: with a Staging Area and Data Marts

Figure 2: Data Warehouse Architecture

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 3: Three-tier architecture of data warehouse

Data warehouse systems and its components:


Data warehousing is typically used by larger companies analyzing larger sets of data for enterprise
purposes. The data warehouse architecture is based on a relational database system server that
functions as the central warehouse for informational data. Operational data and processing is
purely based on data warehouse processing. This central information system is used some key
components designed to make the entire environment for operational systems. It’s mainly
created to support different analysis, queries that need extensive searching on a larger scale.

Figure 4: Data Warehouse Components


Operational Source System
Operational systems are tuned for known transactions and workloads, while workload is not
known a priori in a data warehouse. Traditionally data base system used for transaction
processing systems which stores transaction data of the organizations business. Its generally used
one record at any time not stores history of the information’s.
Data Staging Area
As soon as the data arrives into the Data staging area it is set of ETL process that extract data
from source system. It is converted into an integrated structure and format. It consists of
following processes:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Data is extracted from source system and stored, cleaned, transform functions that may
be applied to load into data warehouse.
• Removing unwanted data from operational databases.
• Converting to common data names and definitions.
• Establishing defaults for missing data accommodating source data definition changes.
Data Presentation Area
Data presentation area are the target physical machines on which the data warehouse data is
organized and stored for direct querying by end users, report writers and other applications. It’s
the place where cleaned, transformed data is stored in a dimensionally structured warehouse
and made available for analysis purpose.
Data Access Tools
End user data access tools are any clients of the data warehouse. An end user access tool can be
a complex as a sophisticated data mining or modeling applications.

Why preprocessing?
1. Real world data are generally
➢ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
➢ Noisy: containing errors or outliers
➢ Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
➢ Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
➢ Data integration: using multiple databases, data cubes, or files.
➢ Data transformation: normalization and aggregation.
➢ Data reduction: reducing the volume but producing the same or similar analytical results.
➢ Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.
Steps Involved in Data Preprocessing:
1. Data Cleaning/Cleansing
Real-world data tend to be incomplete, noisy, and inconsistent. Data Cleaning/Cleansing routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data. Data can be noisy, having incorrect attribute values. Owing to the
following, the data collection instruments used may be fault.
➢ Maybe human or computer errors occurred at data entry
➢ Errors in data transmission can also occur
“Dirty” data can cause confusion for the mining procedure. Although most mining routines have
some procedures, they deal incomplete or noisy data, which are not always robust. Therefore, a
useful Data Preprocessing step is to run the data through some Data Cleaning/Cleansing routines.
2. Data Integration
Data Integration is involved in data analysis task which combines data from multiple sources into
a coherent data store, as in data warehousing. These sources may include multiple databases,
data cubes, or flat files. The issue to be considered in Data Integration is schema integration.
How can real-world entities from multiple data sources be ‘matched up’? This is referred as entity
identification problem. For example, how can a data analyst be sure that customer_id in one
database and cust_number in another refer to the same entity? The answer is metadata.
Databases and data warehouses typically have metadata. Simply, metadata is data about data.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Metadata is used to help avoiding errors in schema integration. Another important issue is
redundancy. An attribute may be redundant, if it is derived from another table. Inconsistencies
in attribute or dimension naming can also cause redundancies in the resulting data set.
3. Data Transformation
Data are transformed into appropriate forms of mining. Data Transformation involves the
following:
1. In Normalization, where the attribute data are scaled to fall within a small specified range,
such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from the data. Such techniques include binning,
clustering, and regression.
3. In Aggregation, summary or aggregation operations are applied to the data. For example,
daily sales data may be aggregated so as to compute monthly and annual total amounts. This
step is typically used in constructing a data cube for analysis of the data at multiple
granularities.
4. In Generalization of the Data, low level or primitive/raw data are replaced by higher level
concepts through the use of concept hierarchies. For example, categorical attributes are
generalized to higher level concepts street into city or country. Similarly, the values for
numeric attributes may be mapped to higher level concepts like, age into young, middle-
aged, or senior.
4. Data Reduction
Complex data analysis and mining on huge amounts of data may take a very long time, making
such analysis impractical or infeasible. Data Reduction techniques are helpful in analyzing the
reduced representation of the data set without compromising the integrity of the original data
and yet producing the qualitative knowledge. Strategies for data reduction include the following:
I. In Data Cube Aggregation, aggregation operations are applied to the data in the construction
of a data cube.
II. In Dimension Reduction, irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.
III. In Data Compression, encoding mechanisms are used to reduce data set size. The methods
used for Data Compression are Wavelet Transform and Principle Component Analysis.
IV. In Numerosity Reduction, data is replaced or estimated by alternative and smaller data
representations such as parametric models (which store only the model parameters instead of
the actual data, e.g. Regression and Log-Linear Models) or non-parametric methods (e.g.
Clustering, Sampling, and the use of histograms).
V. In Discretization and Concept Hierarchy Generation, raw data values for attributes are
replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at
multiple levels of abstraction and are powerful tools for data mining.

Design of Data Warehouse


Design Methods
Bottom-up design
This architecture makes the data warehouse more of a virtual reality than a physical reality. In
the bottom-up approach, starts with extraction of data from operational database into the
staging area where it is processed and consolidated for specific business processes. The bottom-
up approaches reverse the positions of the data warehouse and the data marts. These data marts
can then be integrated to create a comprehensive data warehouse.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Top-down design
The data flow in the top down OLAP environment begins with data extraction from the
operational data sources. The top-down approach is designed using a normalized enterprise data
model. The results are obtained quickly if it is implemented with iterations. It is time consuming
process with an iterative method and the failure risk is very high.
Hybrid design
The hybrid approach aims to harness the speed and user orientation of the bottom up approach
to the integration of the top-down approach. To consolidate these various data models, and
facilitate the extract transform load process, data warehouses often make use of an operational
data store, the information from which is parsed into the actual DW. The hybrid approach begins
with an ER diagram of the data mart and a gradual extension of the data marts to extend the
enterprise model in a consistent linear fashion. It will provide rapid development within an
enterprise architecture framework.
Dimension and Measures
Data warehouse consists of dimensions and measures. It is a logical design technique used for
data warehouses. Dimensional model allow data analysis from many of the commercial OLAP
products available today in the market. For example, time dimension could show you the
breakdown of sales by year, quarter, month, day and hour.
Measures are numeric representations of a set of facts that have occurred. The most common
measures of data dispersion are range, the five number summery (based on quartiles), the inter-
quartile range, and the standard deviation. Examples of measures include amount of sales,
number of credit hours, store profit percentage, sum of operating expenses, number of past-due
accounts and so forth.

Data Marts
A data mart is a specialized system that brings together the data needed for a department or
related applications. A data mart is a simple form of a data warehouse that is focused on a single
subject (or functional area), such as educational, sales, operations, collections, finance or
marketing data. The sources may contain internal operational systems, central data warehouse,
or external data. It is a small warehouse which is designed for the department level.
Dependent, Independent or stand-alone and Hybrid Data Marts
Three basic types of data marts are dependent, independent or stand-alone, and hybrid. The
categorization is based primarily on the data source that feeds the data mart.
Dependent data marts: Data comes from warehouse. It is actually created a separate physical
data-store.
Independent data marts: A standalone systems built by drawing data directly from operational
or external sources of data or both. Independent data mart are independent and focuses
exclusively on one subject area. It has a separate physical data store.
Hybrid data marts: Can draw data from operational systems or data warehouses.
Dependent Data Marts
A dependent data mart allows you to unite your organization's data in one data warehouse.
This gives you the usual advantages of centralization. Figure 5 shows a dependent data mart.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 5: Dependent Data Mart


Independent or Stand-alone Data Marts
An independent data mart is created without the use of a central data warehouse. This could
be desirable for smaller groups within an organization. It is not, however, the focus of this
Guide. Figure 6 shows an independent data mart.

Figure 6: Independent Data Marts


Hybrid Data Marts
A hybrid data mart allows you to combine input from sources other than a data warehouse.
This could be useful for many situations, especially when you need ad hoc integration, such as
after a new group or product is added to the organization. Provides rapid development within
an enterprise architecture framework. Figure 7 shows hybrid data mart.

Figure 7: Hybrid Data Mart

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.
Metadata includes the following:
1. The location and descriptions of warehouse systems and components.
2. Names, definitions, structures, and content of data-warehouse and end-users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user analytical
tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition, and changing
policies.
Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.

Figure 8: Categories of Metadata

Conceptual Modeling of Data Warehouses


It may also be defined by discretizing or grouping values for a given dimension or attribute,
resulting in a set-grouping model. A conceptual data model identifies the highest-level
relationships between the different entities. Features of conceptual data model include:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Includes the important entities and the relationships among them.


• No attribute is specified.
• No primary key is specified.
Conceptual Data Model

Figure 9: Conceptual data model

Data Warehousing Schemas


From the figure above, we can see that the only information shown via the conceptual data
model is the entities that describe the data and the relationships between those entities. There
may be more than one concept hierarchy for a given attribute or dimension, based on different
users view points. No other information is shown through the conceptual data model.
➢ Star Schema
➢ Snowflake Schema
➢ Fact Constellation
Star Schema
• Consists of set of relations known as Dimension Table (DT) and Fact Table (FT)
• A single large central fact table and one table for each dimension.
• A fact table primary key is composition of set of foreign keys referencing dimension
tables.
• Every dimension table is related to one or more fact tables.
• Every fact points to one tuple in each of the dimensions and has additional attributes
• Does not capture hierarchies directly.

Figure 10: Star Schema Example

Snowflake Schema
• Variant of star schema model.
• Used to remove the low cardinality.
• A single, large and central fact table and one or more tables for each dimension.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Dimension tables are normalized split dimension table data into additional tables. But this
may affect its performance as joins needs to be performed.
• Query performance would be degraded because of additional joins. (delay in processing)

Figure 11: Snowflake Schema Example

Fact Constellation:
• As its name implies, it is shaped like a constellation of stars (i.e. star schemas).
• Allow to share multiple fact tables with dimension tables.
• This schema is viewed as collection of stars hence called galaxy schema or fact
constellation.
• Solution is very flexible, however it may be hard to manage and support.
• Sophisticated application requires such schema.

Figure 12: Fact Constellation Example

Overview of Partitioning in Data Warehouses


Data warehouses often contain very large tables and require techniques both for managing these
large tables and for providing good query performance across them. An important tool for
achieving this, as well as enhancing data access and improving overall application performance
is partitioning.
Partitioning offers support for very large tables and indexes by letting you decompose them into
smaller and more manageable pieces called partitions. This support is especially important for
applications that access tables and indexes with millions of rows and many gigabytes of data.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Partitioned tables and indexes facilitate administrative operations by enabling these operations
to work on subsets of data. For example, you can add a new partition, organize an existing
partition, or drop a partition with minimal to zero interruption to a read-only application.
When adding or creating a partition, you have the option of deferring the segment creation until
the data is first inserted, which is particularly valuable when installing applications that have a
large footprint.
Why is it necessary to Partition?
Partitioning is important for the following reasons −
• For easy management,
• To assist backup/recovery,
• To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size
of fact table is very hard to manage as a single entity. Therefore, it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces
the time to load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant.
It does not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the user
queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented as
a set of small partitions for relatively current data, larger partition for inactive data.

Figure 13: Partitioning by Time Example

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Vertical Partition
Vertical partitioning, splits the data vertically. The following image depicts how vertical
partitioning is done.

Figure 14: Vertical Partition Example

Vertical partitioning can be performed in the following two ways −


• Normalization
• Row Splitting

Multidimensional Data Model


Data warehouses are generally based on ‘multi-dimensional” data model. The multidimensional
data model provides a framework that is intuitive and efficient, that allow data to be viewed and
analyzed at the desired level of details with a good performance. The multidimensional model
start with the examination of factors affecting decision-making processes is generally
organization specific facts, for example sales, shipments, hospital admissions, surgeries, and so
on. One instances of a fact correspond with an event that occurred. For example, every single
sale or shipment carried out is an event. Each fact is described by the values of a set of relevant
measures that provide a quantitative description of events. For example, receipts of sales,
amount of shipment, product cost are measures.
The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during
interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the
queries are complex. Dimension tables support changing the attributes of the dimension without
changing the underlying fact table. The multidimensional data model is designed to solve
complex queries in real time. The multidimensional data model is important because it enforces
simplicity.
The multidimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels, and attributes. The simplicity of the model is inherent because it defines
objects that represent real-world business entities. Analysts know which business measures they
are interested in examining, which dimensions and attributes make the data meaningful, and
how the dimensions of their business are organized into levels and hierarchies. Figure shows the
relationships among the logical objects.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 15: Logical Multidimensional Model

Aggregates
In data warehouse huge amount of data is stored that makes analyses of data very difficult. This
is the basic reason why selection and aggregation is required to examine specific part of data.
Aggregations are the way by which information can be divided so queries can be run on the
aggregated part and not the whole set of data. These are pre-calculated summaries derived from
the most granular fact table. It is a process for information is gathered and expressed in a
summary form, for purposes such as statistical analysis. A common aggregation purpose is to get
more information about particular groups based on specific variables such as age, profession, or
income. The information about such groups can then be used for web site personalization. Tables
are always changing along with the needs of the users so it is important to define the
aggregations according to what summary tables might be of use.
Pattern Warehouse
The patterns a data mining system discovers are stored in a Pattern Warehouse. Just as a data
warehouse stores data, the Pattern Warehouse stores patterns - it is an information repository
that stores relationships between data items, but not the data. While data items are stored in
data warehouse, we use the Pattern Warehouse to store the patterns and relationships among
them.
The Pattern Warehouse is represented as a set of "pattern-tables" within a traditional relational
database. This solves several potential issues regarding user access rights, security control, multi-
user access, etc. But obviously, we need a language to access and query the contents of Pattern
Warehouses. SQL may be considered an obvious first candidate for this, but when SQL was
designed over 30 years ago, data mining was not a major issue. SQL was designed to access data
stored in databases. We need pattern-oriented languages to access Pattern Warehouses storing
various types of exact and inexact patterns. Often, it is very hard to access these patterns with
SQL.
Hence a Pattern Warehouse cannot be conveniently queried in a direct way using a relational
query language. Not only are some patterns not easily stored in a simple tabular format, but by
just looking up influence factors in pattern-tables we may get incorrect results. We need a
"pattern-kernel" that consistently manages and merges patterns. The pattern-kernel forms the
heart of Pattern Query Language (PQL). The PQL, which does for decision support spaces what
SQL, does for the data space. While SQL relies on the relational algebra, PQL uses the "pattern
algebra". PQL was designed to access Pattern Warehouses just as SQL was designed to access
databases. PQL was designed to be very similar to SQL. It allows knowledge-based queries just as
SQL allows data based queries.
-----------------------------***-----------------------------

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

-------------------------------------------------------------------------------------------------
Unit-II
OLAP Systems: Basic concepts, OLAP queries, Types of OLAP servers, OLAP operations etc.
Data Warehouse Hardware and Operational Design: Security, Backup and Recovery
-------------------------------------------------------------------
OLAP System:
Basic Concepts-
OLAP (Online Analytical Processing) is the technology support the multidimensional view of data
for many Business Intelligence (BI) applications. OLAP provides fast, steady and proficient
access, powerful technology for data discovery, including capabilities to handle complex queries,
analytical calculations, and predictive “what if” scenario planning.
OLAP is a category of software technology that enables analysts, managers and executives to gain
insight into data through fast, consistent, interactive access in a wide variety of possible views of
information that has been transformed from raw data to reflect the real dimensionality of the
enterprise as understood by the user. OLAP enables end-users to perform ad hoc analysis of data
in multiple dimensions, thereby providing the insight and understanding they need for better
decision making.

Characteristics of OLAP System


The need for more intensive decision support prompted the introduction of a new generation of
tools. Generally used to analyze the information where huge amount of historical data is stored.
Those new tools, called online analytical processing (OLAP), create an advanced data analysis
environment that supports decision making, business modeling, and operations research.
Its four main characteristics are:
1. Multidimensional data analysis techniques
2. Advanced database support
3. Easy to use end user interfaces
4. Support for client/server architecture.
1. Multidimensional Data Analysis Techniques:
Multidimensional analysis is inherently representative of an actual business model. The most
distinctive characteristic of modern OLAP tools is their capacity for multidimensional analysis
(for example actual vs budget). In multidimensional analysis, data are processed and viewed as
part of a multidimensional structure. This type of data analysis is particularly attractive to business
decision makers because they tend to view business data as data that are related to other business
data.
2. Advanced Database Support:
➢ For efficient decision support, OLAP tools must have advanced data access features.
Access too many different kinds of DBMSs, flat files, and internal and external data
sources.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

➢ Access to aggregated data warehouse data as well as to the detail data found in operational
databases.
➢ Advanced data navigation features such as drill-down and roll-up.
➢ Rapid and consistent query response times.
➢ The ability to map end-user requests, expressed in either business or model terms, to the
appropriate data source and then to the proper data access language (usually SQL).
➢ Support for very large databases. As already explained the data warehouse can easily and
quickly grow to multiple gigabytes and even terabytes.
3. Easy-to-Use End-User Interface:
Advanced OLAP features become more useful when access to them is kept simple. OLAP tools
have equipped their sophisticated data extraction and analysis tools with easy-to-use graphical
interfaces. Many of the interface features are “borrowed” from previous generations of data
analysis tools that are already familiar to end users. This familiarity makes OLAP easily accepted
and readily used.
4. Client/Server Architecture:
Conform the system to the principals of Client/server architecture to provide a framework within
which new systems can be designed, developed, and implemented. The client/server environment
enables an OLAP system to be divided into several components that define its architecture. Those
components can then be placed on the same computer, or they can be distributed among several
computers. Thus, OLAP is designed to meet ease-of-use requirements while keeping the system
flexible.
Motivation for using OLAP
I) Understanding and improving sales: For an enterprise that has many products and uses a number
of channels for selling the products, OLAP can assist in finding the most popular products and the
most popular channels. In some cases it may be possible to find the most profitable customers.
II) Understanding and reducing costs of doing business: Improving sales is one aspect of
improving a business, the other aspect is to analyze costs and to control them as much as possible
without affecting sales. OLAP can assist in analyzing the costs associated with sales.
Guidelines for OLAP Implementation
Following are a number of guidelines for successful implementation of OLAP. The guidelines are,
somewhat similar to those presented for data warehouse implementation.
1. Vision: The OLAP team must, in consultation with the users, develop a clear vision for the
OLAP system. This vision including the business objectives should be clearly defined, understood,
and shared by the stakeholders.
2. Senior management support: The OLAP project should be fully supported by the senior
managers and multidimensional view of data. Since a data warehouse may have been developed
already, this should not be difficult.
3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the ROLAP and
MOLAP tools available in the market. Since tools are quite different, careful planning may be
required in selecting a tool that is appropriate for the enterprise. In some situations, a combination
of ROLAP and MOLAP may be most effective.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

4. Corporate strategy: The OLAP strategy should fit in with the enterprise strategy and business
objectives. A good fit will result in the OLAP tools being used more widely.
5. Focus on the users: The OLAP project should be focused on the users. Users should, in
consultation with the technical professional, decide what tasks will be done first and what will be
done later. Attempts should be made to provide each user with a tool suitable for that person’s skill
level and information needs. A good GUI user interface should be provided to non-technical users.
The project can only be successful with the full support of the users.
6. Joint management: The OLAP project must be managed by both the IT and business
professionals. Many other people should be involved in supplying ideas. An appropriate committee
structure may be necessary to channel these ideas.
7. Review and adapt: As noted in last chapter, organizations evolve and so must the OLAP systems.
Regular reviews of the project may be required to ensure that the project is meeting the current
needs of the enterprise.
OLTP vs. OLAP
Table 2.1: Difference between OLAP and OLTP

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.1: OLAP vs OLTP


OLAP Queries
OLAP queries are complex queries that:
➢ Touch large amounts of data
➢ Discover patterns and trends in the data
➢ Typically expensive queries that take long time
➢ Also called decision-support queries

Example-I:

Query Syntax: SELECT …GROUP BY ROLLUP ( grouping_Column_reference_list);

Example-II:

SELECT Time, Location, product, sum(revenue) AS Profit FROM sales GROUP BY ROLLUP
(Time, Location, product);

The Query calculates the standard aggregate values specified in the GROUP BY clause. Then, it
creates progressively higher-level subtotals, moving from right to left through the list of grouping
columns. Finally, it creates a grand total.

OLAP Servers
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
Types of OLAP Servers
We have four types of OLAP servers −
➢ Relational OLAP (ROLAP)
➢ Multidimensional OLAP (MOLAP)

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

➢ Hybrid OLAP (HOLAP)


➢ Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store
and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
➢ Implementation of aggregation navigation logic
➢ Optimization for each DBMS back end
➢ Additional tools and services
➢ Can handle large amounts of data
➢ Performance can be slow
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
➢ Multidimensional data stores
➢ The storage utilization may be low if the data set is sparse.
➢ MOLAP server uses two levels of data storage representation to handle dense and sparse
data sets.
Hybrid OLAP
Hybrid OLAP technologies attempt to combine the advantages of MOLAP and ROLAP. It offers
higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allow storing
the large data volumes of detailed information. The aggregations are stored separately in MOLAP
store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.
OLAP Operations:
Four types of analytical operations in OLAP are:
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be
performed in 2 ways:
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based on
their order or level.
Consider the following diagram:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.2: Roll-up operation

• In this example, cities New jersey and Lost Angles and rolled up into country USA
• The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively. They
become 2000 after roll-up
• In this aggregation process, data is location hierarchy moves up from city to the country.
• In the roll-up process at least one or more dimensions need to be removed. In this example,
Quarter dimension is removed.

2) Drill-down:
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process. It can
be done via
• Moving down the concept hierarchy
• Increasing a dimension

Figure 2.3: Drill-down operation


Consider the diagram above:
• Quarter Q1 is drilled down to months January, February, and March. Corresponding sales
are also registers.
• In this example, dimension months are added.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:

Figure 2.4: Slice operation


• Dimension Time is Sliced with Q1 as the filter.
• A new cube is created altogether.

Dice:
This operation is similar to a slice. The difference in dice is to select 2 or more dimensions that
result in the creation of a sub-cube.

Figure 2.5: Dice operation

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

4) Pivot
In Pivot, rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.

Figure 2.6: Pivot operation


Hardware and operational design
Server Hardware
Two main hardware architectures
➢ Symmetric Multi-Processing (SMP)
➢ Massively Parallel Processing (MPP)
- SMP machine is a set of tightly coupled CPUs that share memory and disk.
- MPP machine is a set of loosely coupled CPUs, each of which has its own memory and disk.
Symmetric Multi-Processing (SMP)
An SMP machine is a set of CPU s that share memory and disk. This is sometimes called a shared-
everything environment the CPUs in an SMP machine are all equal a process can run on any CPU
in the machine, run on different CPUs at different times.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.7: Symmetric Multi-Processing (SMP)

Scalability of SMP machines


Length of the communication bus connecting the CPUs is a natural limit. As the bus gets longer
the inter-process or communication speeds become slower, each extra CPU imposes an extra,
band with load on the bus, increases memory contention, and so on.
Massively Parallel Processing (MPP)
Made up of many loosely coupled nodes linked together by a high-speed connection. Each node
has its own memory, and the disks are not shared. Most MPP systems allow a disk to be dual
connected between two nodes.

Figure 2.8: Massively Parallel Processing (MPP)


Shared Nothing Environment
a) Nothing is shared directly. Many different models of MPP

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

➢ Nodes with two CPUs


-One CPU is dedicated to handling I/O
-Nodes that are genuine SMP machines
➢ Single CPU nodes
-Some will be dedicated to handling I/O and others to processing
b) Level of shared nothing varies from vendor to vendor
Example: IBM SP/2 is a. fully shared nothing system
Virtual Shared Disk (VSD)
➢ An extra layer of software to be added to allow disks to be shared across nodes
➢ System will suffer overheads if an I/O is issued and data has to be shipped from node to
node
Distributed Lock Manager
➢ MPP machines require this to maintain the integrity of the distributed resources
➢ Track which node's memory holds the current copy of each piece of information
➢ Getting the data rapidly from node to node

Network Hardware
Network Architecture
➢ Sufficient bandwidth to supply the data feed and user requirements
Impact to design
➢ User access via WAN – impacts the design of Query Manager
➢ Source system data transfer
➢ Data extractions
Example: Problem of getting the data from the source systems
It may not get the data to the warehouse system early enough to allow it to be loaded, transformed,
processed and backed up within the overnight time window.
Guideline
➢ Ensure that the network architecture and bandwidth are capable of supporting the data
transfer and any data extractions in an acceptable time.
➢ The transfer of data to be loaded must be complete quickly enough to allow the rest of the
overnight processing to complete.

Client Hardware
Client Management
➢ Those responsible for client machine management will need to know the requirements for
that machine to access the data warehouse system.
➢ Details such as the network protocols supported on the server, and the server's Internet
address, will need to be supplied.
➢ If multiple access paths to the server system exist this information needs to be relayed to
those responsible for the client systems.
-During node fall over users may need to access a different machine address.
Client Tools

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

➢ The tool should not be allowed to affect the basic design of the warehouse itself.
➢ Multiple tools will be used against the data warehouse.
➢ Should be thoroughly tested and trialed to ensure that they are suitable for the users.
➢ Testing of the tools should ideally be performed in parallel with the data warehouse design:
✓ Usability issues to be exposed,
✓ Drive out any requirements that the tool will place on the data warehouse

Disk Technology:
RAID Technology
Redundant Array of Inexpensive Disks
➢ The purpose of RAID technology is to provide resilience against disk failure, so that the
loss of an individual disk does not mean loss of data.
➢ Striping is a technique in which data is spread across multiple disks.
➢ RAID levels 0, 1 and 5 are commercially viable and thus widely available
Table 2.2: RAID Level with Descriptions

Figure 2.9: RAID Technology

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

What is Data Warehouse Security?


Data warehouses pull data from many different sources, and warehouses have many moving parts.
Security issues arise every time data moves from one place to another. Data warehouse security
entails taking the necessary steps to protect information so that only authorized personnel can
access it.
Data warehouse security should involve the following:
➢ Strict user access control so that users of the warehouse can only access data they need to
do their jobs.
➢ Taking steps to secure networks on which data is held.
➢ Carefully moving data and considering the security implications involved in all data
movement.
Physical Security Practices
1. Restricting and controlling physical access to data warehouses has been made easy thanks to
tech innovations like biometric readers, anti-tailgating systems and other physical access control
mechanisms. These might look excessive and expenditure overhead, but they play a crucial role in
ensuring the integrity and safety of the precious enterprise data.
2. Imparting information about security protocols and ensuring all the personnel in the proximity
of the data warehouse religiously obey and adhere to these rules is one of the keys to success. It’s
understandable that an employee can be used by intruders to gain access, but if the employee in
question is ardently following the specified guidelines it makes a world of difference.
Software-Based Security Measures
Data Encryption
Data encryption is one of the primary safeguards against data theft. All data should be encrypted
using algorithms like AES (advanced encryption standard) or FIPS 140-2 certified software for
data encryption, whether it’s in the transactional database or the data warehouse. Some proponents
would argue that data encryption adversely affects the performance and data access speed of data
centers, but considering the alternative, it is preferred.
Data Segmenting and Partitioning
Data encryption although an added security measure can be quite cumbersome if applied without
segmenting and partitioning. Segmenting and partitioning entail classifying or splitting data into
sensitive and non-sensitive information. After going through partitioning the data should be
accordingly encrypted and put into separate tables ready for consumption.
Securing On-The-Move Data
Securing data in a single place and transit are two different ball games. Here data in transit means
the one which is being relayed from transactional databases in real time to the data warehouse.
These transactional databases can be anywhere geographically, therefore using protective
protocols, such as SSL or TSL is highly recommended. Cloud-based data warehouses nowadays
provide a secure and impenetrable tunnel between the database and the cloud storage which should
be leveraged.
Trusted Witness Server
As mentioned earlier, hackers and intruders nowadays have become as skilled and sophisticated
as the security measures they are up against. Implementing a trusted witness server is akin to hiring

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

a watchdog that avidly and quite tenaciously keeps vigil on your data access points. It can detect
an unwarranted and suspicious attempt at accessing data and generate an alert immediately. This
allows the people responsible for the data warehouse security to stop the intruders dead in their
tracks.

Backup and recovery of Data Warehouse


Backup and recovery are among the most important tasks for an administrator, and data
warehouses are no different. However, because of the sheer size of the database, data warehouses
introduce new challenges for an administrator in the backup and recovery area.
Data warehouses are unique in that the data can come from a myriad of resources and it is
transformed before finally being inserted into the database; but mostly because it can be very large.
Managing the recovery of a large data warehouse can be a daunting task and traditional OLTP
backup and recovery strategies may not meet the needs of a data warehouse.
Strategies and Best Practices for Backup and Recovery
Devising a backup and recovery strategy can be a daunting task. And when you have hundreds of
gigabytes of data that must be protected and recovered in the case of a failure, the strategy can be
very complex.
The following best practices can help you implement your warehouse's backup and recovery
strategy:
• Best Practice A: Use ARCHIVELOG Mode
• Best Practice B: Use RMAN
• Best Practice C: Use Read-Only Tablespaces
• Best Practice D: Plan for NOLOGGING Operations
• Best Practice E: Not All Tablespaces are Equally Important
Best Practice-A: Use ARCHIVELOG Mode
Archived redo logs are crucial for recovery when no data can be lost, because they constitute a
record of changes to the database. Oracle can be run in either of two modes:
• ARCHIVELOG -- Oracle archives the filled online redo log files before reusing them in
the cycle.
• NOARCHIVELOG -- Oracle does not archive the filled online redo log files before reusing
them in the cycle.
Running the database in “ARCHIVELOG” mode has the following benefits:
• The database can be completely recovered from both instance and media failure.
• The user can perform backups while the database is open and available for use.
• Oracle supports multiplexed archive logs to avoid any possible single point of failure on
the archive logs
• The user has more recovery options, such as the ability to perform tablespace-point-in-time
recovery (TSPITR).
• Archived redo logs can be transmitted and applied to the physical standby database, which
is an exact replica of the primary database.
• The database can be completely recovered from both instance and media failure.
Running the database in “NOARCHIVELOG” mode has the following consequences:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• The user can only back up the database while it is completely closed after a clean shutdown.
• Typically, the only media recovery option is to restore the whole database, which causes
the loss of all transactions since the last backup.
Best Practice-B: Use RMAN
There are many reasons to adopt RMAN. Some of the reasons to integrate RMAN into your backup
and recovery strategy are that it offers:
• Extensive reporting
• Incremental backups
• Downtime free backups
• Backup and restore validation
• Backup and restore optimization
• Easily integrates with media managers
• Block media recovery
• Archive log validation and management
• Corrupt block detection
[Under Case study]
Best Practice C: Use Read-Only Tablespaces
Best Practice D: Plan for NOLOGGING Operation
Best Practice E: Not All Tablespaces are Equally Important

-----------------------------***-----------------------------

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

-
Unit-III
Introduction to Data & Data Mining: Data Types, Quality of data, Data Preprocessing, Similarity
measures, Summary statistics, Data distributions, Basic data mining tasks, Data Mining V/s
Knowledge discovery in databases. Issues in Data mining. Introduction to Fuzzy sets and Fuzzy
logic.

Introduction to Data
In computing, data is information that has been translated into a form that is efficient for movement
or processing. Relative to today's computers and transmission media, data is information converted
into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject.
Raw data is a term used to describe data in its most basic digital format.
Computers represent data, including video, images, sounds and text, as binary values using patterns
of just two numbers: 1 and 0. A bit is the smallest unit of data, and represents just a single value.
A byte is eight binary digits long. Storage and memory is measured in megabytes and gigabytes.
Growth of the web and smartphones over the past decade led to a surge in digital data creation.
Data now includes text, audio and video information, as well as log and web activity records. Much
of that is unstructured data.
In general, data is any set of characters that is gathered and translated for some purpose, usually
analysis. If data is not put into context, it doesn't do anything to a human or computer.
There are multiple types of data. Some of the more common types of data include the following:
➢ Single character
➢ Boolean (true or false)
➢ Text (string)
➢ Number (integer or floating-point)
➢ Picture
➢ Sound
➢ Video
In a computer's storage, data is a series of bits (binary digits) that have the value one or zero. Data
is processed by the CPU, which uses logical operations to produce new data (output) from source
data (input).

Introduction to Data Mining


Data Mining (DM) is processing data to identify patterns and establish relationships. DM is the
process of analyzing data from different perspectives and summarizing it into useful information.
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

This information can be used in decision making. DM is the extraction of hidden predictive
information from large amounts of data stored in the data warehouse for useful information, using
technology with great potential to help companies focus on the most important information.
Data mining tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge-driven decisions. The automated, prospective analyses offered by data mining move
beyond the analyses of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too time
consuming to resolve. They scour databases for hidden patterns, finding predictive information
that experts may miss because it lies outside their expectations.
“Data mining is the identification or extraction of relationships and patterns from data using
computational algorithms to reduce, nodal, understand, or analyze data.”
Data Mining Functionalities
Different kind of patterns can be discovered depending on the data mining task in use. There are
mainly two types of data mining tasks:
1. Descriptive Data Mining Tasks
2. Predictive Data Mining Tasks
Descriptive mining tasks characterize the common properties of the existing data. Predictivemining
tasks perform inference on the existing data in order to make predictions.
Types of data that can be mined
1. Data stored in the database
A database is also called a database management system or DBMS. Every DBMS stores data that
are related to each other in a way or the other. It also has a set of software programs that are used
to manage data and provide easy access to it. These software programs serve a lot of purposes,
including defining structure for database, making sure that the stored information remains secured
and consistent, and managing different types of data access, such as shared, distributed, and
concurrent.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple sources and then
stores it in the form of a unified plan. When data is stored in a data warehouse, it undergoes
cleaning, integration, loading, and refreshing. Data stored in a data warehouse is organized in
several parts.
3. Transactional data
Transactional database stores records that are captured as transactions. These transactions include
flight booking, customer purchase, click on a website, and others. Every transaction record has a
unique ID. It also lists all those items that made it a transaction.
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

4. Other types of data


We have a lot of other types of data as well that are known for their structure, semantic meanings,
and versatility. They are used in a lot of applications. Here are a few of those data types: data
streams, engineering design data, sequence data, graph data, spatial data, multimedia data, and
more.

Quality of data
Data quality is a measure of the condition of data based on factors such as accuracy, completeness,
consistency, reliability and whether it's up to date.
Data quality enables you to cleanse and manage data while making it available across your
organization. High-quality data enables strategic systems to integrate all related data to provide a
complete view of the organization and the interrelationships within it. Data quality is an essential
characteristic that determines the reliability of decision-making.

Data is a valuable asset that must be managed as it moves through an organization. As information
sources are growing more numerous and diverse, and regulatory compliance initiatives more
focused, the need to integrate, access and reuse information from these disparate sources
consistently and trustfully is becoming critical.

Data Preprocessing
1. Real world data are generally:
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
onlyaggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolveinconsistencies.
Data integration: using multiple databases, data cubes, or files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.
Data Cleaning
Data cleaning is a technique deal with detecting and removing inconsistencies and error from the
data in-order to get better quality data. Data cleaning is performed as a data preprocessing step
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

while preparing the data for a data warehouse. Good quality data requires passing a set of quality
criteria. Those criteria include: Accuracy, Integrity, Completeness, Validity, Consistency,
Uniformity, Density and Uniqueness.
Data Integration
Data Integration is a data preprocessing technique that takes data from one or more sources and
mapping it, field by field onto a new data structure. Idea is to merge the data from multiple sources
into a coherent data store. Data may be distributed over different databases or data warehouses.
There may be necessity of enhancement of data with additional (external) data. Issues like entity
identification problem.
Data Transformation
In data transformation data are consolidated into appropriate form to make suitable for mining, by
performing summary or aggregation operations. Data transformation involves following:
➢ Data Smoothing
➢ Data aggregation
➢ Data Generalization
➢ Normalization
➢ Attribute Construction

Data Reduction
If the data set is quite huge then the task of data mining and analysis can take much longer time,
making the whole exercise of analysis useless and infeasible. Data reduction is the transformation
of numerical or alphabetical digital information derived empirically or experimentally into a
corrected, ordered, and simplified form. The basic concept is the reduction of multitudinous
amounts of data down to the meaningful parts.
The data reduction strategies include:
➢ Data cube aggregation
➢ Dimensionality reduction
➢ Data discretization and concept hierarchy generation
➢ Attribute Subset Selection

Similarity measures
The similarity measure is the measure of how much alike two data objects are. Similarity measure
in a data mining context is a distance with dimensions representing features of the objects. If this
distance is small, it will be the high degree of similarity where large distance will be the low degree
of similarity.
The similarity is subjective and is highly dependent on the domain and application. For example,
two fruits are similar because of color or size or taste. Care should be taken when calculating
distance across dimensions/features that are unrelated. The relative values of each element must
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

be normalized, or one feature could end up dominating the distance calculation. Similarity are
measured in the range 0 to 1 [0,1].
Two main considerations about similarity:

• Similarity = 1 if X = Y (Where X, Y are two objects)


• Similarity = 0 if X ≠ Y

Distance or similarity measures are essential in solving many pattern recognition problems such
as classification and clustering. As the names suggest, a similarity measures howclose two
distributions are.
Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1(complete
similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike) to
∞ (objects are different)
Proximity
Refers to a similarity or dissimilarity.

Summary statistics
Summary statistics summarize and provide information about your sample data. It tells you
something about the values in your data set. This includes where the average lies and whether your
data is skewed. Summary statistics fall into three main categories:

➢ Measures of location (also called central tendency).


➢ Measures of spread.
➢ Graphs/charts
Summary Statistics: Measures of location
Measures of location tell you where your data is centered at, or where a trend lies. Following
common measures of location for a full definition and examples for that particular measure:
➢ Mean (also called the arithmetic mean or average)
➢ Geometric mean (used for interest rates and other types of growth)
➢ Trimmed Mean (the mean with outliers excluded)
➢ Median (the middle of a data set)
Summary Statistics: Measures of spread
Measures of spread tell you (perhaps not surprisingly!) how spread out or varied your data set is.
This can be important information. For example, test scores that are in the 60-90 range might be
expected while scores in the 20-70 range might indicate a problem. Range isn’t the only measure
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

of spread though. The names below for a particular measure of spread.

➢ Range (how spread out your data is?)


➢ Interquartile range (where the “middle fifty” percent of your data is?)
➢ Quartiles (boundaries for the lowest, middle and upper quarters of data)
➢ Skewed (does your data have mainly low, or mainly high values?)
➢ Kurtosis (a measure of how much data is in the tails?)
Summary Statistics: Graphs and Charts
There are literally dozens of ways to display summary data using graphs or charts. Some of the
most common ones are listed below.
➢ Histogram
➢ Frequency Distribution Table
➢ Box plot
➢ Bar chart
➢ Scatter plot
➢ Pie chart

Data distributions
A data distribution is a function or a listing which shows all the possible values (or intervals) of
the data. It also (and this is important) tells you how often each value occurs. Often, the data in a
distribution will be ordered from smallest to largest, and graphs and charts allow you to easily
see both the values and the frequency with which they appear.
From a distribution you can calculate the probability of any one particular observation in the
sample space, or the likelihood that an observation will have a value which is less than (or greater
than) a point of interest.
Data distributions are used often in statistics. They are graphical methods of organizing and
displaying useful information. There are several types of data distributions like dot plots,
histograms, box plots, and tally charts. Here we will focus on dot plots and histograms

Dot Plots
Dot plots show numerical values plotted on a scale. Each dot represents one value in the set of
data. In the example below, the customer service ratings range from 0 to 9. The dots tell us
the frequency, or rate of occurrence, of customers who gave each rating. If you look at the 5 rating,
you can see that three customers gave that rating, and if you look at a score of 9, eight customers
gave that rating. We can also see that ratings were provided by fifty customers, one dot for each
customer.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 1: Example of a dot plot

Now imagine that ratings were provided by five hundred customers. It would not be practical or
useful to have a distribution of five hundred dots. For this reason, dot plots are used for data that
have a relatively small number of values.

Histograms
Histograms display data in ranges, with each bar representing a range of numeric values. The
height of the bar tells you the frequency of values that fall within that range. In the example below,
the first bar represents black cherry trees that are between 60 and 65 feet in height. The bar goes
up to three, so there are three trees that are between 60 and 65 feet.

Figure 2: Example of a histogram


Histograms are an excellent way to display large amounts of data. If you have a set of data that
includes thousands of values, you can simply adjust the frequency interval to accommodate a
larger scale, rather than just 0-10.

Basic data mining tasks


The two "high-level" primary goals of data mining, in practice, are prediction and description.

1. Prediction involves using some variables or fields in the database to predict unknown or
future values of other variables of interest.
2. Description focuses on finding human-interpretable patterns describing the data.
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

The relative importance of prediction and description for particular data mining applications can
vary considerably. However, in the context of KDD, description tends to be more important than
prediction. This is in contrast to pattern recognition and machine learning applications (such as
speech recognition) where prediction is often the primary goal of the KDD process.
The goals of prediction and description are achieved by using the following primary data mining
tasks:
1. Classification is learning a function that maps (classifies) a data item into one of several
predefined classes.
2. Regression is learning a function which maps a data item to a real-valued prediction
variable.
3. Clustering is a common descriptive task where one seeks to identify a finite set ofcategories
or clusters to describe the data.

• Closely related to clustering is the task of probability density estimation which


consists of techniques for estimating, from data, the jointmulti-variate probability density
function of all the variables/fields in the database.

4. Summarization involves methods for finding a compact description for a subset of data.
5. Dependency Modeling consists of finding a model which describes significant
dependencies between variables. Dependency models exist at two levels:
1. The structural level of the model specifies (often graphically) which variables are
locallydependent on each other, and
2. The quantitative level of the model specifies the strengths of the dependencies
usingsome numerical scale.
Change and Deviation Detection focuses on discovering the most significant changes in the data
from previously measured or normative values.

Data Mining V/s Knowledge discovery in databases


With the enormous amount of data stored in files, databases, and other repositories, it is
increasingly important, if not necessary, to develop powerful means for analysis and perhaps
interpretation of such data and for the extraction of interesting knowledge that could help in
decision-making.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the
nontrivial extraction of implicit, previously unknown and potentially useful information from data
in databases. While data mining and knowledge discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of the knowledge discovery process. The
following figure (Figure 3) shows data mining as a step in an iterative knowledge discovery
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

process.

Figure 3: KDD process

The Knowledge Discovery in Databases process comprises of a few steps leading from raw data
collections to some form of new knowledge. The iterative process consists of the following steps:

• Data cleaning: Also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
• Data integration: At this stage, multiple data sources, often heterogeneous, may be
combined in a common source.

• Data selection: At this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
• Data transformation: Also known as data consolidation, it is a phase in which the selected
data is transformed into forms appropriate for the mining procedure.
• Data mining: It is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
• Pattern evaluation: In this step, strictly interesting patterns representing knowledge are
identified based on given measures.
• Knowledge representation: Is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results.
Data mining derives its name from the similarities between searching for valuable information in
a large database and mining rocks for a vein of valuable ore. Both imply either sifting through a
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

large amount of material or ingeniously probing the material to exactly pinpoint where the values
reside. It is, however, a misnomer, since mining for gold in rocks is usually called "gold mining"
and not "rock mining", thus by analogy, data mining should have been called "knowledge mining"
instead. Nevertheless, data mining became the accepted customary term, and very rapidly a trend
that even overshadowed more general terms such as knowledge discovery in databases (KDD) that
describe a more complete process. Other similar terms referring to data mining are: data dredging,
knowledge extraction and pattern discovery.

Issues in Data mining


Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this, we will discuss the major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues

The following diagram describes the major issues.

Figure 4: Issues in Data mining


Issues that need to be addressed by any serious data mining package are:
i. Uncertainty Handling
ii. Dealing with Missing Values
iii. Dealing with Noisy Data
iv. The Efficiency of Algorithms
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

v. Constraining Knowledge Discovered to only Useful


vi. Incorporating Domain Knowledge
vii. Size and Complexity of Data
viii. Data Selection
ix. Understandability of Discovered Knowledge: Consistency between Data and Discovered
Knowledge
Data Mining System Classification
A data mining system can be classified according to the following criteria −
➢ Database Technology
➢ Statistics
➢ Machine Learning
➢ Information Science
➢ Visualization
➢ Other Disciplines

Figure 5: Data Mining System Classification


Data mining systems depend on databases to supply the raw input and this raises problems, such
as the databases tend to be dynamic, incomplete, dynamic, noisy and large. Other problems arise
as a result of the inadequacy and irrelevance of the information stored. The difficulties in data
mining can be categorized as
a) Limited information
b) Noise or missing data
c) User interaction and prior knowledge
d) Uncertainty
e) Size, updates and irrelevant fields

Introduction to Fuzzy sets and Fuzzy logic:


The word fuzzy refers to things which are not clear or are vague. Any event, process, or function
that is changing continuously cannot always be defined as either true or false, which means that
we need to define such activities in a Fuzzy manner.
Fuzzy Logic resembles the human decision-making methodology. It deals with vague and
imprecise information. This is gross oversimplification of the real-world problems and based on
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

degrees of truth rather than usual true/false or 1/0 like Boolean logic.
Take a look at the following diagram. It shows that in fuzzy systems, the values are indicated by
a number in the range from 0 to 1. Here 1.0 represents absolute truth and 0.0 represents absolute
falseness. The number which indicates the value in fuzzy systems is called the truth value.

Figure 6: Fuzzy System


In other words, we can say that fuzzy logic is not logic that is fuzzy, but logic that is used to
describe fuzziness. There can be numerous other examples like this with the help of which we can
understand the concept of fuzzy logic.

A set is an unordered collection of different elements. It can be written explicitly by listing its
elements using the set bracket. If the order of the elements is changed or any element of a set is
repeated, it does not make any changes in the set.

Example:
A set of all positive integers.
A set of all the planets in the solar system.
A set of all the states in India.
A set of all the lowercase letters of the alphabet.
Mathematical Representation of a Set
Sets can be represented in two ways −
Roster or Tabular Form
In this form, a set is represented by listing all the elements comprising it. The elements are
enclosed within braces and separated by commas.
Following are the examples of set in Roster or Tabular Form −
• Set of vowels in English alphabet, A = {a,e,i,o,u}
• Set of odd numbers less than 10, B = {1,3,5,7,9}
follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Set Builder Notation


In this form, the set is defined by specifying a property that elements of the set have in
common. The set is described as A = {x:p(x)}
Example 1 − The set {a,e,i,o,u} is written as:

A = {x:x is a vowel in English alphabet}

Example 2 − The set {1,3,5,7,9} is written as:

B = {x:1 ≤ x < 10 and (x%2) ≠ 0}


If an element x is a member of any set S, it is denoted by x∈ S and if an element y is not a
member of set S, it is denoted by y∉ S.
Example − If S = {1,1.2,1.7,2},1 ∈ S but 1.5 ∉ S

Cardinality of a Set

Cardinality of a set S, denoted by |S||S|, is the number of elements of the set. The number is also
referred as the cardinal number. If a set has an infinite number of elements, its cardinality is ∞.
Example − |{1,4,3,5}| = 4,|{1,2,3,4,5,…}| = ∞
If there are two sets X and Y, |X| = |Y| denotes two sets X and Y having same cardinality. It occurs
when the number of elements in X is exactly equal to the number of elements in Y. In this case,
there exists a bijective function ‘f’ from X to Y.
|X| ≤ |Y| denotes that set X’s cardinality is less than or equal to set Y’s cardinality. It occurs when
the number of elements in X is less than or equal to that of Y. Here, there exists an injective
function ‘f’ from X to Y.
|X| < |Y| denotes that set X’s cardinality is less than set Y’s cardinality. It occurs when the number
of elements in X is less than that of Y. Here, the function ‘f’ from X to Y is injective function but
not bijective.
If |X| ≤ |Y| and |X| ≤ |Y| then |X| = |Y|. The sets X and Y are commonly referred as equivalent
sets.
***

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

CS-703(B) Open Elective, Data Mining and Warehousing


-------------------------------------------------------------------------------------------------
UNIT-IV: Supervised Learning: Classification: Statistical-based algorithms, Distance-based
algorithms, Decision tree-based algorithms, neural network-based algorithms, Rule-based
algorithms, Probabilistic Classifiers
--------------------------------------------------------------------
Supervised Learning
Supervised learning, also known as supervised machine learning, is a subcategory of machine
learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms
that to classify data or predict outcomes accurately. In supervised learning, algorithms learn from
labeled data. After understanding the data, the algorithm determines which label should be given
to new data by associating patterns to the unlabeled new data.
Supervised learning can be divided into two categories:
I. Classification: The model finds classes in which to place its inputs.
II. Regression: The model finds outputs that are real variables.
Basically supervised learning is a learning in which we train the machine using data which is well
labeled that means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples (data) so that supervised learning algorithm analyses the
training data (set of training examples) and produces a correct outcome from labeled data.

Figure 1: Supervised Learning

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Classification
Classification is a technique for determining which class the dependent belongs to based on one
or more independent variables. Classification is a data mining function that assigns items in a
collection to target categories or classes. The goal of classification is to accurately predict the target
class for each case in the data.
For example, a classification model could be used to identify loan applicants as low, medium, or
high credit risks.
There are various applications of classification algorithms as:
1. Medical Diagnosis
2. Image and pattern recognition
3. Fault detection
4. Financial market position etc.
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows −
i) Classification
ii) Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
There are three main approaches to classify problem:
1. The first approach divides the space defined by data points into regions and each region
correspond to a given class.
2. The second approach is to find the probability of an example belonging to each class.
3. The third approach is to find the probability of a class containing that example.

Statistical-based algorithms
Statistical Distribution-Based Outlier Detection: The statistical distribution-based approach to
outlier detection assumes a distribution or probability model for the given data set (e.g., a Normal
or Poisson distribution) and then identifies outliers with respect to the model using a discordancy
test. Application of the test requires knowledge of the data set parameters knowledge of
distribution parameters such as the mean and variance and the expected number of outliers.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

A statistical discordancy test examines two hypotheses:


➢ A working hypothesis
➢ An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial
distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its rejection.
A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to the
distribution F. Different test statistics have been proposed for use as a discordancy test, depending
on the available knowledge of the data.
Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the
statistic for object oi is vi, then the distribution of T is constructed. Significance probability, SP
(vi) = Prob (T > vi), is evaluated. If SP (vi) is sufficiently small, then oi is discordant and the
working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model, G, is
adopted. The result is very much dependent on which model F is chosen because oi may be an
outlier under one model and a perfectly valid value under another. The alternative distribution is
very important in determining the power of the test, that is, the probability that the working
hypothesis is rejected when oi is really an outlier. There are different kinds of alternative
distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all the objects come from distribution F is rejected in
favor of the alternative hypothesis that all of the objects arise from another distribution.
G: H:Pr (F>0) oi € G, where i = 1, 2.…, n
F and G may be different distributions or differ only in parameters of the same distribution.
There are constraints on the form of the G distribution in that it must have potential to produce
outliers. For example, it may have a different mean or dispersion.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population, but
contaminants from some other population, G. In this case, the alternative hypothesis is:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number) arise
independently from the initial model, F, with its given parameters, whereas the remaining objects
are independent observations from a modified version of F in which the parameters have been
shifted.
There are two basic types of procedures for detecting outliers:
Block procedures: In this case, either all of the suspect objects are treated as outliers or all of
them are accepted as consistent.
Consecutive procedures: Its main idea is that the object that is least likely to be an outlier is tested
first. If it is found to be an outlier, then more extreme values are also considered outliers; otherwise,
the next most extreme object is tested, and so on. This procedure tends to be more effective than
block procedures.

Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitations imposed by
statistical methods. An object, o, in a data set, D, is a distance-based (DB) outlier with parameters
pct and dmin, that is, a DB (pct; dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a
distance greater than dmin from o. In other words, rather than relying on statistical tests, we can
think of distance-based outliers as those objects that do not have enough neighbors, where
neighbors are defined based on distance from the given object. In comparison with statistical-based
methods, distance based outlier detection generalizes the ideas behind discordancy testing for
various standard distributions. Distance-based outlier detection avoids the excessive computation
that can be associated with fitting the observed distribution into some standard distribution and in
selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier according to the given
test, then o is also a DB (pct, dmin)-outlier for some suitably defined pct and dmin. For example,
if objects that lie three or more standard deviations from the mean are considered to be outliers,
assuming a normal distribution, then this definition can be generalized by a DB (0.9988, 0.13s)
outlier. Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Given a data set, the index-based algorithm uses multi dimensional indexing structures, such as R-
trees or k-d trees, to search for neighbors of each object o within radius dmin around that object.
Let M be the maximum number of objects within the dmin-neighborhood of an outlier. Therefore,
once M+1 neighbors of object o are found, it is clear that o is not an outlier. This algorithm has a
worst-case complexity of O (n2k), where n is the number of objects in the data set and k is the
dimensionality. The index-based algorithm scales well as k increases. However, this complexity
evaluation takes only the search time into account, even though the task of building an index in
itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexity as the index-based algorithm
but avoids index structure construction and tries to minimize the number of I/Os. It divides the
memory buffer space into two halves and the data set into several logical blocks. By carefully
choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.

Decision tree-based algorithms


A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class shown in figure 2.

Figure 2: Decision Tree

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

The benefits of having a decision tree are as follows −


i. It does not require any domain knowledge.
ii. It is easy to comprehend.
iii. The learning and classification steps of a decision tree are simple and fast.

The Tree Induction Algorithm: (Decision Tree Induction Algorithm)


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3
and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
The algorithm needs three parameters: D (Data Partition), Attribute List, and Attribute Selection
Method. Initially, D is the entire set of training tuples and associated class labels. Attribute list is
a list of attributes describing the tuples. Attribute selection method specifies a heuristic procedure
for selecting the attribute that “best” discriminates the given tuples according to class.

Neural network-based algorithms


➢ A neural network is a set of connected input/output units in which each connection has a
weight associated with it.
➢ During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples.
➢ Neural network learning is also referred to as connectionist learning due to the connections
between units.
➢ Neural networks involve long training time.
➢ Back propagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
➢ The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
➢ For each training tuple, the weights are modified so, minimize the mean squared error
between the network’s prediction and the actual target value. These modifications are made
in the ―backwards direction, that is, from the output layer, through each hidden layer down
to the first hidden layer hence the name is back propagation.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

➢ Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.

Figure 3: Neural Network


Advantages:
➢ It includes their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
➢ They can be used when user may have little knowledge of the relationships between
attributes and classes.
➢ They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
➢ They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
➢ Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Rule-based algorithms:
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification.
Rule Format:
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes) (buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction:
Learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm:
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require generating a decision tree first. In this algorithm, each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the
rule is removed and the process continues for the rest of the tuples. This is because the path to
each leaf in a decision tree corresponds to a rule.
The following is the sequential learning algorithm where rules are learned for one class at a time.
When learning a rule from a class Ci, rule to cover all the tuples from class C only and no tuple
form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuple,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
Remove tuples covered by rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data.
• The rule may perform well on training data but less well on subsequent data.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
Where pos and neg is the number of positive and negative tuples covered by R, respectively.

Probabilistic Classifiers:
A Bayes classifier is a probabilistic model that is used for supervised learning. A Bayes classifier
is based on the idea that the role of a class is to predict the values of features for members of that
class. Examples are grouped in classes because they have common values for some of the features.
Such classes are often called natural kinds. The learning agent learns how the features depend on
the class and uses that model to predict the classification of a new example.
The simplest case is the naive Bayes classifier, which makes the independence assumption
that the input features are conditionally independent of each other given the classification. The
independence of the naive Bayes classifier is embodied in a belief network where the features are
the nodes, the target feature (the classification) has no parents, and the target feature is the only
parent of each input feature. This belief network requires the probability distributions P(X)
P(Y) for the target feature, or class, YY and P(Xi∣Y) P(Xi∣Y) for each input feature XiXi. For

example, the prediction is computed by conditioning on observed values for the input features and
querying the classification. Multiple target variables can be modeled and learned separately.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 4: Belief network corresponding to a naive Bayes classifier

Learning a Bayes Classifier


To learn a classifier, the distributions of P(Y)P(Y) and P(Xi∣Y)P(Xi∣Y) for each input feature can
be learned from the data. Each conditional probability distribution P(Xi∣Y)P(Xi∣Y) may be treated
as a separate learning problem for each value of YY.
The simplest case is to use the maximum likelihood estimate (the empirical proportion in the
training data as the probability), where P(Xi=xi∣Y=y)P(Xi=xi∣Y=y) is the number of cases
where Xi=xi∧Y=yXi=xi∧Y=y divided by the number of cases where Y=yY=y.

-----------------------------***-----------------------------

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

CS-703(B) Open Elective, Data Mining and Warehousing


---------------------------------------------------------------------------------------------------------------------
- UNIT-V: Clustering & Association Rule mining: Hierarchical algorithms, Partitional
algorithms,
Clustering large databases – BIRCH, DBSCAN, CURE algorithms. Association rules: Parallel and
distributed algorithms such as Apriori and FP growth algorithms.
----------------------------------------------------------------------------------------------------------------------
Introduction:
Cluster Analysis in Data Mining
Cluster Analysis in Data Mining means that to find out the group of objects which are similar to
each other in the group but are different from the object in other groups.
In clustering, a group of different data objects is classified as similar objects. One group means a
cluster of data. Data sets are divided into different groups in the cluster analysis, which is based on
the similarity of the data. After the classification of data into various groups, a label is assigned to
the group. It helps in adapting to the changes by doing the classification.
 Clustering is the method of converting a group of abstract objects into classes of similar objects.
 Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
 It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step
for other algorithms.

Association Rule Mining:


Association rules are ‘if/then’ statements that help uncover relationships between seemingly
unrelated data in a relational database or other information repository. An example of an association
rule would be "If a customer buys a dozen eggs, he is 80% likely to also purchase milk." Association
rule mining is the data mining process of finding the rules that may govern associations and causal
objects between sets of items.
Association rules analysis is a technique to uncover how items are associated to each other. There
are three common ways to measure association.
Measure 1: Support

Measure 2: Confidence

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Measure 3: Lift

The main applications of association rule mining:


 Basket data analysis
 Cross marketing
 Catalog design
 Web Mining
 Medical Analysis
 Bioinformatics
 Network Analysis
 Programming Pattern Finding
 Clustering, Classification, etc.

Hierarchical algorithms:
Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in
the dataset. It does not require to pre-specify the number of clusters to be generated.
The result of hierarchical clustering is a tree-based representation of the objects, which is also
known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a
desired similarity level.
R code to compute and visualize hierarchical clustering:
# Compute hierarchical clustering
res.hc <- USArrests %>% scale() %>%
# Scale the data dist(method = "euclidean") %>%
# Compute dissimilarity matrix hclust(method = "ward.D2")
# Compute hierachical clustering # Visualize using factoextra
# Cut in 4 groups and color by groups fviz_dend(res.hc, k = 4,
# Cut in four groups cex = 0.5,
# label size k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), color_labels_by_k =
TRUE,
# color labels by groups rect = TRUE # Add rectangle around groups )

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 1: Hierarchical method

The method will create a hierarchical decomposition of a given set of data objects. Based on how
the hierarchical decomposition is formed, we can classify hierarchical methods. This method is
given as follows:

• Agglomerative Approach
• Divisive Approach

Agglomerative Approach is also known as Button-up Approach. Here we begin with every object
that constitutes a separate group. It continues to fuse objects or groups close together.

Divisive Approach is also known as the Top-Down Approach. We begin with all the objects in the
same cluster. This method is rigid, i.e., it can never be undone once a fusion or division is completed.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm
to group objects into micro-clusters, and then performing macro-clustering on the micro-
clusters.
Partitional Algorithms:
Suppose, given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of
data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k
groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.
Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups,
where k is the number of groups pre-specified by the analyst.
There are different types of partitioning clustering methods. The most popular is the K-means
clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the
data points belonging to the cluster. The K-means method is sensitive to outliers.
An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around
Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means.
The following R codes show how to determine the optimal number of clusters and how to compute
k-means and PAM clustering in R.
library ("factoextra") fviz_nbclust(my_data, kmeans, method = "gap_stat")
set.seed(123) km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize library("factoextra") fviz_cluster(km.res, data = my_data, ellipse.type = "convex",
palette = "jco", ggtheme = theme_minimal())
# Compute PAM library("cluster") pam.res <- pam(my_data, 3)
# Visualize fviz_cluster(pam.res)

Figure 2: Partitional Clustering


Clustering Large Databases:
BIRCH Algorithms
BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies, which uses
hierarchical methods to cluster and reduce data.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

BIRCH only needs to scan the data set in a single pass to perform clustering. The BIRCH algorithm
is more suitable for the case where the amount of data is large and the number of categories K is
relatively large. It runs very fast, and it only needs a single pass to scan the data set for clustering.
The BIRCH algorithm uses a tree structure to create a cluster. It is generally called the Clustering
Feature Tree (CF Tree). Each node of this tree is composed of several Clustering Features (CF).
Clustering Feature tree structure is similar to the balanced B+ tree from the figure-3 below, we can
see what the clustering feature tree looks like.
Each node including leaf nodes has several CFs, and the CFs of internal nodes have pointers to child
nodes, and all leaf nodes are linked by a doubly linked list.
A CF Tree structure is given as below:

• Each non-leaf node has at most B entries.


• Each leaf node has at most L CF entries which satisfy threshold T, a maximum diameter of
radius.
• P (page size in bytes) is the maximum size of a node.
• Compact: each leaf node is a sub-cluster, not a data point.

Figure 3: CF Tree
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
 It is a scalable clustering method.
 Designed for very large data sets.
 Only one scan of data is necessary.
 It is based on the notation of CF (Clustering Feature) tree.
 CF tree is a height balanced tree that stores the clustering features for a hierarchical
clustering.
 Cluster of data points is represented by a triple of numbers (N,LS,SS), where
N= Number of items in the sub cluster, LS=Linear sum of the points, SS=sum of the squared of the
points

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

DBSCAN Algorithm:
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a very famous density-
based clustering algorithm. Intuitively, the DBSCAN algorithm can find all the dense regions of the
sample points and treat these dense regions as clusters one by one.
The DBSCAN algorithm has the following characteristics:
 Density-based, robust to noise points which are away from the density core
 No need to know the number of clusters
 Can create clusters of arbitrary shape
DBSCAN is usually suitable for cluster analysis of lower-dimensional data.

Figure 4: DBSCAN

Basic concept
The basic concepts of DBSCAN can be summarized by 1, 2, and 3.
1. Core idea: Based on density
2. Algorithm parameters: Neighborhood radius (eps/ɛ) and the minimum number of points (min_points).
These two algorithm parameters can describe what is dense in the cluster.
When the number of points within the neighborhood radius (eps) is greater than the minimum number
of points (min_points), it is dense.

Figure 5: DBSCAN parameter

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

3. Types of points: Core points, boundary points, and noise points


• The point where the number of sample points in the neighborhood radius (eps) is greater than or
equal to min points is called the core point.
• Points that do not belong to the core point but are within the neighborhood of a certain core point
are called boundary points.
• Noise points are neither core points nor boundary points.

Figure 6: DBSCAN points

DBSCAN Algorithm Steps:

The algorithm step of DBSCAN is divided into two steps:


1. Find the core point to form a temporary cluster:
Scan all sample points. If the number of points within a radius of a certain sample point is> =
MinPoints, It will be included in the list of core points, and the points with direct density will form
corresponding temporary clusters.
2. Combine temporary clusters to obtain clusters:
For each temporary cluster, check whether the point in it is a core point. If so, merge the temporary
cluster corresponding to the point with the current temporary cluster to obtain a new temporary
cluster.
This operation is repeated until each point in the current temporary cluster is either not in the core
point list, or the point with a direct density is already in the temporary cluster, and the temporary
cluster is upgraded to a cluster.
Continue to perform the same merge operation on the remaining temporary clusters until all the
temporary clusters are processed.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

CURE Algorithms
CURE (Clustering Using REpresentatives) is an efficient data clustering algorithm for large
databases. Compared with K-means clustering it is more robust to outliers and able to identify
clusters having non-spherical shapes and size variances.
• It is a hierarchical based clustering technique that adopts a middle ground between the centroid
based and the all-point extremes. Hierarchical clustering is a type of clustering that starts with a
single point cluster and moves to merge with another cluster, until the desired number of clusters
are formed.
• It is used for identifying the spherical and non-spherical clusters.
• It is useful for discovering groups and identifying interesting distributions in the underlying data.
• Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of
well-defined representative points, for efficiently handling the clusters and eliminating the
outliers.

Six steps in CURE algorithm:

Figure 7: Six steps in CURE algorithm


CURE Algorithm
• Input: 𝑘𝑘, the number of clusters
• Draw a random sample of the points
• Each point is its own cluster initially
• Each cluster stores a point of representative points and a mean
• Build a kd-tree of all clusters
• Use the tree to build a heap that stores 𝑢𝑢.closest for each cluster 𝑢𝑢
• While size(heap) > 𝑘𝑘:
 Merge together the two closest clusters in the heap
 Update the representative points in each cluster
 Update the tree and the heap
Merging step:
 New mean is a mean of the means of the clusters being merged

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

 Select c well-scattered points based on their distance from the new mean
 Shrink each representative point closer to the mean
The two stages of CURE implementation are as follows:
1) Initialization in CURE
• Take a small sample data and cluster it in main memory using hierarchical clustering algorithm.
• Select a small set of points from each cluster to be representative points. These points should be
chosen to be as far from one another as possible.
• Move each of the representative points a fixed fraction of the distance between its location and the
centroid of its cluster. The fraction could be about 20% or 30% of the original distance.
2) Completion of CURE Algorithm
• Once initialization is complete, we have to cluster the renaming points and output the final
cluster.
• The final step of CURE is to merge two clusters if they have a pair of representative points,
one from each cluster, that are sufficiently close. The user may pick the distance threshold.
The merging step can be repeated until there are no more sufficiently close clusters.
• Each point P is brought from secondary storage and compared with the representative points.
We assign p to the cluster of the representative point that is closes to P.
Figures illustrates CURE Algorithm:

Figure 8: Representation of partitioning and clustering

Association rules:
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently itemset occurs in a transaction. A typical example is Market
Based Analysis.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy together
frequently.
Association rules are "if-then" statements, which help to show the probability of relationships
between data items, within large data sets in various types of databases.
Association rules are usually required to satisfy a user-specified minimum support and a user-
specified minimum confidence at the same time. Association rule generation is usually split up into
two separate steps: A minimum support threshold is applied to find all frequent itemsets in a
database.
Association Rules find all sets of items (itemsets) that have support greater than the minimum
support and then using the large itemsets to generate the desired rules that have confidence greater
than the minimum confidence. The lift of a rule is the ratio of the observed support to that expected
if X and Y were independent.

Example

Figure 9: Association Rule Example

Parallel and Distributed Algorithms:


Parallel and distributed computing is expected to relieve current mining methods from the
sequential bottleneck, providing the ability to scale to massive datasets, and improving the response
time. Achieving good performance on today’s multiprocessor systems is a non-trivial task. The
main challenges include synchronization and communication minimization, work-load balancing,
finding good data layout and data decomposition, and disk I/O minimization, which is especially
important for data mining.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Apriori Algorithms:
Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and
relevant association rules. It is devised to operate on a database containing a lot of transactions, for
instance, items brought by customers in a store.
1. Candidate itemsets are generated using only the large itemsets of the previous pass without
considering the transactions in the database.
2. The large itemset of the previous pass is joined with itself to generate all itemsets whose
size is higher by 1.
3. Each generated itemset that has a subset which is not large is deleted. The remaining itemsets
are the candidate ones.

Figure 10: Apriori Example


The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset is also a
frequent itemset. The algorithm can therefore, reduce the number of candidates being considered
by only exploring the itemsets whose support count is greater than the minimum support count. All
infrequent itemsets can be pruned if it has an infrequent subset.

FP Growth Algorithms:
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without
the need for candidate generation. FP growth algorithm represents the database in the form of a tree
called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is reduced
comparatively.
The FP-Growth Algorithm, proposed by Han, is an efficient and scalable method for mining the
complete set of frequent patterns by pattern fragment growth, using an extended prefix-
tree structure for storing compressed and crucial information about frequent patterns
named frequent-pattern tree (FP-tree).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 11: FP Growth (Example-1)

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 12: FP Growth (Example-2)

Advantages of FP Growth Algorithm


1. This algorithm needs to scan the database only twice when compared to Apriori which scans
the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in

You might also like