You are on page 1of 19

DATA MINING AND DATA WAREHOUSING (18CS641)

Module 2

Syllabus
Data warehouse implementation& Data mining: Efficient Data Cube computation: An overview,
Indexing OLAP Data: Bitmap index and join index, Efficient processing of OLAP Queries, OLAP server
Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction: What is data mining, Challenges,
Data Mining Tasks, Data: Types of Data, Data Quality, Data Preprocessing, Measures of Similarity and
Dissimilarity.

Efficient Data Cube computation

The following are general optimization techniques for efficient computation of data cubes which as
follows −
Sorting, hashing, and grouping − Sorting, hashing, and grouping operations must be used to the
dimension attributes to reorder and cluster associated tuples. In cube computation, aggregation is
implemented on the tuples that share the similar set of dimension values. Therefore, it is essential to
analyse sorting, hashing, and grouping services to access and group such data to support evaluation of
such aggregates.
It can calculate total sales by branch, day, and item. It can be more effective to sort tuples or cells by
branch, and thus by day, and then group them as per the item name. An effective performance of such
operations in huge data sets have been widely considered in the database research community.
Such performance can be continued to data cube computation. This method can also be continued to
implement shared-sorts (i.e., sharing sorting costs across different cuboids when sort-based techniques are
used), or to implement shared-partitions (i.e., sharing the partitioning cost across different cuboids when
hash-based algorithms are utilized).

Simultaneous aggregation and caching of intermediate results − In cube computation, it is effective to


calculate higher-level aggregates from earlier computed lower-level aggregates, instead of from the base
fact table. Furthermore, simultaneous aggregation from cached intermediate computation results can lead
to the decline of high-priced disk input/output (I/O) operations.
It can compute sales by branch, for instance, it can use the intermediate results changed from the
computation of a lower-level cuboid including sales by branch and day. This methods can be continued to
implement amortized scans (i.e., computing as several cuboids as possible simultaneously to amortize
disk reads).

Aggregation from the smallest child when there exist multiple child cuboids − When there exist
several child cuboids, it is generally more effective to calculate the desired parent (i.e., more generalized)
cuboid from the smallest, formerly computed child cuboid.
The Apriori pruning method can be explored to compute iceberg cubes efficiently − The Apriori
property in the context of data cubes, defined as follows: If a given cell does not fulfil minimum support,

Dr. Manash Sarkar | Data Mining and Data Warehousing 1


therefore no descendant of the cell (i.e., more specific cell) will satisfy minimum support. This property
can be used to largely decrease the computation of iceberg cubes.
The description of iceberg cubes includes an iceberg condition, which is a constraint on the cells to be
materialized. A general iceberg condition is that the cells should satisfy a minimum support threshold
including a minimum count or sum. In this term, the Apriori property can be used to shorten away the
inspection of the cell‘s descendants.

The compute cube Operator and the Curse of Dimensionality

One approach to cube computation extends SQL so as to include a compute cube operator. The compute
cube operator computes aggregates over all subsets of the dimensions specified in the operation. This can
require excessive storage space, especially for large numbers of dimensions. We start with an intuitive
look at what is involved in the efficient computation of data cubes.

Example 1: A data cube is a lattice of cuboids. Suppose that you want to create a data cube for
AllElectronics sales that contains the following: city, item, year, and sales in dollars. You want to be able
to analyze the data, with queries such as the following: ―Compute the sum of sales, grouping by city and
item.‖ ―Compute the sum of sales, grouping by city.‖ ―Compute the sum of sales, grouping by item.‖
What is the total number of cuboids, or group-by‘s, that can be computed for this data cube? Taking the
three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as the
measure, the total number of cuboids, or groupby‘s, that can be computed for this data cube is 23 D 8.
The possible group-by‘s are the following: f(city, item, year), (city, item), (city, year), (item, year), (city),
(item), (year), ./g, where ./ means that the group-by is empty (i.e., the dimensions are not grouped). These
group-by‘s form a lattice of cuboids for the data cube, as shown in Figure 1.

Fugure 1: Data Cube

Dr. Manash Sarkar | Data Mining and Data Warehousing 2


What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube
method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

Indexing OLAP Data: Bitmap Index and Join Index


To facilitate efficient data accessing, most data warehouse systems support index structures and
materialized views (using cuboids). General methods to select cuboids for materialization. In this
subsection, we examine how to index OLAP data by bitmap indexing and join indexing.

Bitmap Indexes

Bitmap indexes are widely used in data warehousing environments. The environments typically have
large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such
applications, bitmap indexing provides:

 Reduced response time for large classes of ad hoc queries


 Reduced storage requirements compared to other indexing techniques
 Dramatic performance gains even on hardware with a relatively small number of CPUs or a small
amount of memory
 Efficient maintenance during parallel DML and loads

Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of
space because the indexes can be several times larger than the data in the table. Bitmap indexes are
typically only a fraction of the size of the indexed data in the table.

It is a special type of indexing built on a single key. But, it is designed to fire queries on multiple keys
quickly. We need to arrange the records in sequential order before applying bitmap indexing on it. It
makes it simple to fetch a particular record from the block. Also, it becomes easy to allocate them in the
block of a file.

Bitmap Index Structure


The word 'bitmap' comprises of 'bit' and 'map'. A bit is the smallest unit of data in a computer system. A
map means organizing things. Thus, a bitmap is simply mapping of bits in the form of an array. In a
relation, each attribute carries one bitmap for its value. A bitmap has sufficient bits for numbering each
record in the block.

Dr. Manash Sarkar | Data Mining and Data Warehousing 3


Example
Bitmap indexing. In the AllElectronics data warehouse, suppose the dimension item at the top level has
four values (representing item types): “home entertainment,” “computer,” “phone,” and “security.”
Each value (e.g., “computer”) is represented by a bit vector in the item bitmap index table. Suppose that
the cube is stored as a relation table with 100,000 rows. Because the domain of item consists of four
values, the bitmap index table requires four bit vectors (or lists), each with 100,000 bits. Figure 4.15
shows a base (data) table containing the dimensions item and city, and its mapping to bitmap index tables
for each of the dimensions.

Figure 2: Bitmap Table

Dr. Manash Sarkar | Data Mining and Data Warehousing 4


Advantages –
 Efficiency in terms of insertion deletion and updation.
 Faster retrieval of records

Disadvantages –
 Only suitable for large tables
 Bitmap Indexing is time consuming

Join Index
The join indexing method gained popularity from its use in relational database query processing. Join
indexes are designed to permit queries (join queries in the case of multitable join indexes) to be resolved
by accessing the index instead of accessing, and possibly joining, their underlying base tables. Traditional
indexing maps the value in a given column to a list of rows having that value. In contrast, join indexing
registers the joinable rows of two relations from a relational database. For example, if two relations
R.RID, A/ and S.B, SID/ join on the attributes A and B, then the join index record contains the pair .RID,
SID/, where RID and SID are record identifiers from the R and S relations, respectively. Hence, the join
index records can identify joinable tuples without performing costly join operations. Join indexing is
especially useful for maintaining the relationship between a foreign key2 and its matching primary keys,
from the

Example : Join indexing. we defined a star schema for AllElectronics of the form ―sales star [time, item,
branch, location]: dollars sold D sum (sales in dollars).‖ An example of a join index relationship between
the sales fact table and the location and item imension tables is shown in Figure 4.16. For example, the
―Main Street‖ value in the
location dimension table joins with tuples T57, T238, and T884 of the sales fact table. Similarly, the
―Sony-TV‖ value in the item dimension table joins with tuples T57 and T459 of the sales fact table. The
corresponding join index tables are shown in Figure 3

Dr. Manash Sarkar | Data Mining and Data Warehousing 5


Figure 3: Linkages between a sales fact table and location and item dimension tables

Figure 4: Join index tables based on the linkages between the sales fact table and the location and item
dimension tables shown in Figure 3

JOIN INDEX is a materialized view. Its definition is permanently stored and the data is updated
whenever the base tables referred in the join index is updated. JOIN INDEX may contain one or more
tables and also contain pre-aggregated data. Join indexes are mainly used for improving the
performance.
There are different types of join indexes available.

 Single Table Join Index (STJI)


 Multi Table Join Index (MTJI)
 Aggregated Join Index (AJI)

Dr. Manash Sarkar | Data Mining and Data Warehousing 6


Single Table Join Index

Single-Table join indexes are created from exactly one base table. Their purpose is to have the same table
available with a different primary index, partitioning, or smaller table (the join index table) with fewer
columns to be spooled. This improves the performance of joins as no distribution or duplication is
needed. The user will query on the base table, but PE will decide whether to access the base table or
single table join index.

Single Table Join Index: Fewer columns from single table with different primary index.
Exam:

CREATE JOIN INDEX Single_table_Join_Index_name


AS
SELECT
manager_id,
hire_date,
salary
FROM tutorial_db.employee
PRIMARY INDEX (manager_id) ;

Multi-Table Join Index

 A multi-table join index is created by joining more than one table. A multi-table join index can be
used to store the result set of frequently joined tables to improve the performance.

 A multi-table join index is used to hold a pre-join result set from the two or more columns. So
during join processing, PE may decide to access data from the Multi-table join index rather than
joining again underlying base tables. We need to remember that we should define a multi-table
join index after lots of analysis based on the frequency and cost of joining.

 Multi-Table Join Indexes allow us to move resource-intensive joins from the online to the batch
window.

 Shifting the workload does not reduce the total workload, but it turns it to a point in time that is
beneficial for the system's overall performance.

Multi-Table Join Index: Pre-Joins multiple tables from the base tables.
CREATE JOIN INDEX Multi_table_Join_Index_name AS
SELECT
emp.emp_name,
emp.job_title,
emp.salary,
emp.dept_no,
dept.department_name,
dept.loc_name
FROM tutorial_db.employee as emp

Dr. Manash Sarkar | Data Mining and Data Warehousing 7


INNER JOIN tutorial_db.department as dept
ON emp.dept_no = dept.dept_no
PRIMARY INDEX(dept_no);

Aggregate Join Index


If a table is consistently aggregated on certain columns, then aggregate join index can be defined on the
table to improve the performance. One limitation of aggregate join index is that it supports only SUM
and COUNT functions.
Aggregate Join Index: Aggregates one or more columns from the base tables

CREATE JOIN INDEX Aggregate_Join_Index_name


AS
SELECT
dept_no
,SUM (salary)
FROM tutorial_db.department
GROUP BY 1;

** Only COUNT and SUM clause are allowed in aggregate join index.

Efficient Processing of OLAP Queries


The purpose of materializing cuboids and constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should proceed as follows:

 Determine which operations should be performed on the available cuboids:


transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection +
projection
 Determine to which materialized cuboid(s) the relevant operations should be applied.

Determine which operations should be performed on the available cuboids: This involves transforming
any selection, projection, roll-up (group-by), and drill-down operations specified in the query into
corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube may correspond
to selection and/or projection operations on a materialized cuboid.

Determine to which materialized cuboid(s) the relevant operations should be applied: This involves
identifying all of the materialized cuboids that may potentially be used to answer the query, pruning the

Dr. Manash Sarkar | Data Mining and Data Warehousing 8


set using knowledge of ―dominance‖ relationships among the cuboids, estimating the costs of using the
remaining materialized cuboids, and selecting the cuboid with the least cost.

OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP

OLAP Servers

Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in order to
make business decisions. OLAP provides a platform for gaining insights from databases retrieved from
multiple database systems at the same time. It is based on a multidimensional data model, which enables
users to extract and view data from various perspectives. A multidimensional database is used to store
OLAP data. Many Business Intelligence (BI) applications rely on OLAP technology.

Type of OLAP Servers:


The three major types of OLAP servers are as follows:
 Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
 Greater scalability

 Multidimensional OLAP (MOLAP)


 Array-based multidimensional storage engine (sparse matrix techniques)
 fast indexing to pre-computed summarized data

 Hybrid OLAP (HOLAP)


 User flexibility, e.g., low level: relational, high-level: array

 Specialized SQL servers

Relational OLAP (ROLAP):

Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational
database, where both the base data and dimension tables are stored as relational tables. ROLAP servers
are used to bridge the gap between the relational back-end server and the client‘s front-end tools.
ROLAP servers store and manage warehouse data using RDBMS, and OLAP middleware fills in the
gaps.

Benefits:
 It is compatible with data warehouses and OLTP systems.
 The data size limitation of ROLAP technology is determined by the underlying RDBMS. As
a result, ROLAP does not limit the amount of data that can be stored.

Dr. Manash Sarkar | Data Mining and Data Warehousing 9


Limitations:
 SQL functionality is constrained.
 It‘s difficult to keep aggregate tables up to date.

Figure 5: Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP):

Through array-based multidimensional storage engines, Multidimensional On-Line Analytical Processing


(MOLAP) supports multidimensional views of data. Storage utilization in multidimensional data stores
may be low if the data set is sparse.

MOLAP stores data on discs in the form of a specialized multidimensional array structure. It is used for
OLAP, which is based on the arrays‘ random access capability. Dimension instances determine array
elements, and the data or measured value associated with each cell is typically stored in the corresponding
array element. The multidimensional array is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.

However, unlike ROLAP, which stores only records with non-zero facts, all array elements are defined in
MOLAP, and as a result, the arrays tend to be sparse, with empty elements occupying a larger portion of
them. MOLAP systems typically include provisions such as advanced indexing and hashing to locate data
while performing queries for handling sparse arrays, because both storage and retrieval costs are
important when evaluating online performance. MOLAP cubes are ideal for slicing and dicing data and
can perform complex calculations. When the cube is created, all calculations are pre-generated.

Dr. Manash Sarkar | Data Mining and Data Warehousing 10


Figure 6: Multidimensional OLAP (MOLAP)

Benefits:
 Suitable for slicing and dicing operations.
 Outperforms ROLAP when data is dense.
 Capable of performing complex calculations.

Limitations:
 It is difficult to change the dimensions without re-aggregating.
 Since all calculations are performed when the cube is built, a large amount of data cannot be
stored in the cube itself.

Hybrid OLAP (HOLAP):


ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers
greater scalability than ROLAP and faster computation than MOLAP. HOLAP is a hybrid of ROLAP and
MOLAP. HOLAP servers are capable of storing large amounts of detailed data. On the one hand, HOLAP
benefits from ROLAP‘s greater scalability. HOLAP, on the other hand, makes use of cube technology for
faster performance and summary-type information. Because detailed data is stored in a relational
database, cubes are smaller than MOLAP.

Dr. Manash Sarkar | Data Mining and Data Warehousing 11


Figure 7: Hybrid OLAP (HOLAP)

Benefits:
 HOLAP combines the benefits of MOLAP and ROLAP.
 Provide quick access at all aggregation levels.

Limitations
 Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.
There is a greater likelihood of overlap, particularly in their functionalities

Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some
database system vendors implement specialized SQL servers that provide advanced query language and
query processing support for SQL queries over star and snowflake schemas in a read-only environment.

What is data mining

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary
subfield of computer science and statistics with an overall goal of extracting information (with intelligent
methods) from a data set and transform the information into a comprehensible structure for further use.

Dr. Manash Sarkar | Data Mining and Data Warehousing 12


Challenges, Data Mining Tasks

Security and Social Challenges:

Decision-Making strategies are done through data collection-sharing, so it requires considerable security.
Private information about individuals and sensitive information are collected for customers profiles, user
behaviour pattern understanding. Illegal access to information and the confidential nature of information
becoming an important issue.

User Interface:

The knowledge discovered is discovered using data mining tools is useful only if it is interesting and
above all understandable by the user. From good visualization interpretation of data, mining results can be
eased and helps better understand their requirements. To obtain good visualization many research is
carried out for big data sets that display and manipulate mined knowledge.

(i) Mining based on Level of Abstraction: Data Mining process needs to be collaborative because it
allows users to concentrate on pattern finding, presenting and optimizing requests for data mining based
on returned results.

(ii) Integration of Background Knowledge: Previous information may be used to express discovered
patterns to direct the exploration processes and to express discovered patterns.

Mining Methodology Challenges:

These challenges are related to data mining approaches and their limitations. Mining approaches that
cause the problem are:

 Versatility of the mining approaches,


 Diversity of data available,
 Dimensionality of the domain,
 Control and handling of noise in data, etc.

Data: Types of Data

6 Types of Data in Statistics & Research: Key in Data Science

Quantitative data- Quantitative data can be expressed as a number or can be quantified. Simply put, it can
be measured by numerical variables. Quantitative data are easily amenable to statistical manipulation and
can be represented by a wide variety of statistical types of graphs and charts such as line, bar graph,
scatter plot, and etc.

Qualitative data- Qualitative data can‘t be expressed as a number and can‘t be measured. Qualitative data
consist of words, pictures, and symbols, not numbers. Qualitative data is also called categorical data
because the information can be sorted by category, not by number.

Dr. Manash Sarkar | Data Mining and Data Warehousing 13


Nominal data- Nominal data is used just for labeling variables, without any type of quantitative value.
The name ‗nominal‘ comes from the Latin word ―nomen‖ which means ‗name‘. The nominal data just
name a thing without applying it to order. Actually, the nominal data could just be called ―labels.‖

Examples of Nominal Data:

 Gender (Women, Men)


 Hair color (Blonde, Brown, Brunette, Red, etc.)
 Marital status (Married, Single, Widowed)
 Ethnicity (Hispanic, Asian)

Ordinal data- Ordinal data shows where a number is in order. This is the crucial difference from nominal
types of data. Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority. Ordinal variables are considered as ―in between‖ qualitative and
quantitative variables. In other words, the ordinal data is qualitative data for which the values are ordered.

Examples of Ordinal Data:

 The first, second and third person in a competition.


 Letter grades: A, B, C, and etc.
 When a company asks a customer to rate the sales experience on a scale of 1-10.
 Economic status: low, medium and high.

Discrete data- Discrete data is a count that involves only integers. The discrete values cannot be
subdivided into parts. For example, the number of children in a class is discrete data. You can count
whole individuals. You can‘t count 1.5 kids.

Examples of discrete data:

 The number of students in a class.


 The number of workers in a company.
 The number of home runs in a baseball game.
 The number of test questions you answered correctly

Continuous data- Continuous data is information that could be meaningfully divided into finer levels. It
can be measured on a scale or continuum and can have almost any numeric value. For example, you can
measure your height at very precise scales — meters, centimeters, millimeters and etc. You can record
continuous data at so many different measurements – width, temperature, time, and etc. This is where the
key difference from discrete types of data lies.

Dr. Manash Sarkar | Data Mining and Data Warehousing 14


Examples of continuous data:

 The amount of time required to complete a project.


 The height of children.
 The square footage of a two-bedroom house.
 The speed of cars.

Data Quality

Data quality refers to the state of qualitative or quantitative pieces of information. There are many
definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses
in operations, decision making and planning".

Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of data
quality are based on data quality characteristics such as accuracy, completeness, consistency, validity,
uniqueness, and timeliness.

Data quality refers to the development and implementation of activities that apply quality management
techniques to data in order to ensure the data is fit to serve the specific needs of an organization in a
particular context. Data that is deemed fit for its intended purpose is considered high quality data.
Examples of data quality issues include duplicated data, incomplete data, inconsistent data, incorrect data,
poorly defined data, poorly organized data, and poor data security.

Dr. Manash Sarkar | Data Mining and Data Warehousing 15


Data Preprocessing

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or
enhance performance, and is an important step in the data mining process. The phrase "garbage in,
garbage out" is particularly applicable to data mining and machine learning projects.

Data preprocessing is a step in the data mining and data analysis process that takes raw data and
transforms it into a format that can be understood and analyzed by computers and machine learning. Data
preprocessing is a broad area and consists of a number of different strategies and techniques that are
interrelated in complex ways. We will present some of the most important ideas and approaches, and try
to point out the interrelationships among them. Specifically, we will discuss the following topics:

 Aggregation
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature creation
 Discretization and binarization
 Variable transformation

Measures of Similarity and Dissimilarity

Similarity and dissimilarity are important because they are used by a number of data mining techniques,
such as clustering nearest neighbor classification and anomaly detection. The term proximity is used to
refer to either similarity or dissimilarity.

Measures of Similarity and Dissimilarity

Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as
classification and clustering. Various distance/similarity measures are available in the literature to
compare two data distributions. As the names suggest, a similarity measures how close two distributions
are. For multivariate data complex summary methods are developed to answer this question.

Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1
(complete similarity)

Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike)
to ∞ (objects are different)

Dr. Manash Sarkar | Data Mining and Data Warehousing 16


Proximity
Refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties:
Common Properties of Dissimilarity Measures
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
A distance that satisfies these properties is called a metric. Following is a list of several common distance
measures to compare multivariate data. We will assume that the attributes are all continuous.

Dr. Manash Sarkar | Data Mining and Data Warehousing 17


Euclidean Distance

Minkowski Distance

Dr. Manash Sarkar | Data Mining and Data Warehousing 18


Mahalanobis Distance

Dr. Manash Sarkar | Data Mining and Data Warehousing 19

You might also like