You are on page 1of 32

Module 2 Data Warehouse Implementation & Data Mining

Module 2
Data Warehouse Implementation & Data Mining
1. Explain Efficient Data Cube Computation. 7M

 At the core of multidimensional data analysis is the efficient computation of aggregations across many
sets of dimensions.
 One approach to cube computation extends SQL so as to include a compute cube operator. The
compute cube operator computes aggregates over all subsets of the dimensions specified in the
operation. This can require excessive storage space, especially for large numbers of dimensions.

 Example: A data cube is a lattice of cuboids.


Suppose that you want to create a data cube for AllElectronics sales that contains the following: city,
item, year, and sales in dollars.

To analyze the data, with queries such as the following:


“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”

Here, taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids, or groupby‟s, that can be
computed for this data cube is 23 =8.
The possible group-by‟s are the following: {(city, item, year), (city, item), (city, year), (item, year),
(city), (item), (year), ()} where () means that the group-by is empty.

 Lattice of cuboids making up 3D data cube.

 An SQL query containing no group-by(e.g.,“compute the sum of total sales”) is a zero


dimensional operation. An SQL query containing one group-by (e.g., “compute the sum of
sales, group-bycity”) is a one-dimensional operation. A cube operator on n dimensions is
equivalent to a collection of group-by statements, one for each subset of the n dimensions.
Therefore, the cube operator is the n-dimensional generalization of the group-by operator.
Similar to the SQL syntax, the data cube in Example could be defined as:
define cube sales cube [city, item, year]: sum(sales in dollars)

Dept. of ISE, BNMIT Page 1


Module 2 Data Warehouse Implementation & Data Mining

 For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid. A
statement such as compute cube sales cube would explicitly instruct the system to compute
the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the
empty subset.
 Online analytical processing may need to access different cuboids for different queries.
Therefore, it may seem like a good idea to compute in advance all or at least some of the
cuboids in a data cube. Pre-computation leads to fast response time and avoid some
redundant computation
 A major challenge related to this pre-computation, however, is that the required storage space
may explode if all the cuboids in a data cube are pre-computed, especially when the cube has
many dimensions. The storage requirements are even more excessive when many of the
dimensions have associated concept hierarchies, each with multiple levels. This problem is
referred to as the curse of dimensionality.
 How many cuboids in an n-dimensional cube with L levels?

where Li is the number of levels associated with dimension i. One is added to Li to include the
virtual top level, all. (Note that generalizing to all is equivalent to the removal of the
dimension.)

 If there are many cuboids, and these cuboids are large in size, a more reasonable option is
partial materialization; that is, to materialize only some of the possible cuboids that can be
generated.

Materialize every (cuboid) (full materialization), none (no materialization), or
some (partial materialization)

Selection of which cuboids to materialize -Based on size, sharing, access
frequency, etc

2 Write short notes on: 6M


i. ROLAP
ii. MOLAP
iii. HOLAP

i. ROLAP
o This uses relational or extended-relational DBMS to store and manage warehouse data.
o This may be considered a bottom-up approach to OLAP.
o This is typically based on using a data warehouse that has been designed using a star scheme.
o The data warehouse provides the multidimensional capabilities by representing data in fact-table
and dimension-tables.
o The fact-table contains one column for each dimension and one column for each measure and
every row of the table provides one fact.

Dept. of ISE, BNMIT Page 2


Module 2 Data Warehouse Implementation & Data Mining

For example, each fact is represented as (Bsc, India, 2001-02, 30)


o This tool essentially groups the fact-table to find aggregates and uses some of the aggregates
already computed to find new aggregates.
o Some products in this category are Oracle OLAP mode and OLAP Discoverer.

Advantages:
→ This is more easily used with existing relational DBMS and
→ The data can be stored efficiently using tables.
→ Greater scalability
Disadvantage:
→ Poor query performance

ii. MOLAP
 This is based on using a multidimensional DBMS rather than a data-warehouse to store and access
data.
 This may be considered as a top-down approach to OLAP.
 This does not have a standard approach to storing and maintaining their data. This often uses
special-purpose file systems or indexes that store pre-computation of all aggregations in the cube.
 In MOLAP, the dimension values do not need to be stored since all the values of the cube could be
stored in an array in a predefined way. For example, the cube in figure 8.2 may be represented as
an array like the following:

12 30 31 103 21 197 19 0 32 47 30 128 10 2 …..

 Some MOLAP products are Hyperion Essbase and Applix iTM1

Advantages:
→ Implementation is usually exceptionally efficient
→ Easier to use and therefore may be more suitable for inexperienced users
→ Fast indexing to pre-computed summarized-data
Disadvantages:
→ More expensive than ROLAP
→ Data is not always current
→ Difficult to scale a MOLAP system for very large OLAP problems
→ Storage utilization may be low if the data-set is sparse

iii. HOLAP
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detailed data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL
Server 2000 supports a hybrid OLAP server.

Dept. of ISE, BNMIT Page 3


Module 2 Data Warehouse Implementation & Data Mining

3 Distinguish between MOLAP & ROLAP. 6M

ROLAP MOLAP
MOLAP stands for
ROLAP stands for Relational
Full Form Multidimensional Online
Online Analytical Processing.
Analytical Processing.
Data is stored and fetched Data is Stored and fetched
Storage & Fetched
from the main data from the Proprietary
warehouse. database MDDBs.
Data is stored in the form of Data is Stored in the large
Data Form
relational tables. multidimensional array made
of data cubes.
Data volumes Large data volumes. Limited summaries data is
kept in MDDBs.
MOLAP engine created a
Uses Complex SQL queries precalculated and
Technology to fetch data from the main prefabricated data cubes for
warehouse. multidimensional data views.
Sparse matrix technology is
used to manage data sparsity.

ROLAP creates a MOLAP already stores the


View multidimensional view of static multidimensional view
data dynamically. of data in MDDBs.
Access Slow access. Faster access.

4. Describe the servers involved in implementation of a warehouse server. 8M

i. ROLAP
o This uses relational or extended-relational DBMS to store and manage warehouse data.
o This may be considered a bottom-up approach to OLAP.
o This is typically based on using a data warehouse that has been designed using a star scheme.
o The data warehouse provides the multidimensional capabilities by representing data in fact-table
and dimension-tables.
o The fact-table contains one column for each dimension and one column for each measure and
every row of the table provides one fact.
For example, each fact is represented as (Bsc, India, 2001-02, 30)
o This tool essentially groups the fact-table to find aggregates and uses some of the aggregates
already computed to find new aggregates.
o Some products in this category are Oracle OLAP mode and OLAP Discoverer.

Advantages:
→ This is more easily used with existing relational DBMS and
→ The data can be stored efficiently using tables.
→ Greater scalability

Dept. of ISE, BNMIT Page 4


Module 2 Data Warehouse Implementation & Data Mining

Disadvantage:
→ Poor query performance

ii. MOLAP
 This is based on using a multidimensional DBMS rather than a data-warehouse to store and access
data.
 This may be considered as a top-down approach to OLAP.
 This does not have a standard approach to storing and maintaining their data. This often uses
special-purpose file systems or indexes that store pre-computation of all aggregations in the cube.
 In MOLAP, the dimension values do not need to be stored since all the values of the cube could be
stored in an array in a predefined way. For example, the cube in figure 8.2 may be represented as
an array like the following:
12 30 31 103 21 197 19 0 32 47 30 128 10 2 …..

 Some MOLAP products are Hyperion Essbase and Applix iTM1

Advantages:
→ Implementation is usually exceptionally efficient
→ Easier to use and therefore may be more suitable for inexperienced users
→ Fast indexing to pre-computed summarized-data
Disadvantages:
→ More expensive than ROLAP
→ Data is not always current
→ Difficult to scale a MOLAP system for very large OLAP problems
→ Storage utilization may be low if the data-set is sparse

iii. HOLAP
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP. For example, a HOLAP server may allow large volumes of detailed data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL
Server 2000 supports a hybrid OLAP server.

4 What is indexing? Explain different forms of indexing methods. 8M

Indexing is a way to optimize performance of a database by minimizing the number of disk accesses
required when a query is processed. An index or database index is a data structure which is used to quickly
locate and access the data in a database table. Indexes are created using some database columns.

i. Bitmap indexing:

The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index
for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute‟s domain. If a
given attribute‟s domain consists of n values, then n bits are needed for each entry in the bitmap index
(i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit

Dept. of ISE, BNMIT Page 5


Module 2 Data Warehouse Implementation & Data Mining

representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that
row are set to 0.

Fig: Indexing OLAP Data Using Bitmap Indexes

Bitmap indexing is advantageous compared to hash and tree indices. It is especially useful for low-
cardinality domains because comparison, join, and aggregation operations are then reduced to bit
arithmetic, which substantially reduces the processing time.
Bitmap indexing leads to significant reductions in space and input/output (I/O) since a string of characters
can be represented by a single bit. For higher-cardinality domains, the method can be adapted using
compression techniques

ii. Join Indexing:

The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast, join
indexing registers the joinable rows of two relations from a relational database.

For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join index
record contains the pair (RID, SID), where RID and SID are record identifiers from the R and S relations,
respectively. Hence, the join index records can identify joinable tuples without performing costly join
operations.
Join indexing is especially useful for maintaining the relationship between a foreign key and its matching
primary keys, from the joinable relation.

Example:

Dept. of ISE, BNMIT Page 6


Module 2 Data Warehouse Implementation & Data Mining

5 Explain the concept of data cube materialization. 7M

OLAP operations deal with aggregate data. Hence, materialization or pre-computation of summarized data
are often required to accelerate the DSS(Decision Support System) query processing and data warehouse
design. Otherwise, DSS queries may take long time due to huge size of data warehouse and complexity of
the query, which is not acceptable in DSS environment. Different techniques like query optimizers and
query evaluation techniques are being used to reduce query execution time. View materialization is also
one very important technique, which is used in DSS to reduce query response time.

There are three choices for data cube materialization given a base cuboid:

1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing
expensive multidimensional aggregates on-the-fly, which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed cuboids is
referred to as the full cube. This choice typically requires huge amounts of memory space in order
to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids.
Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy
some user-specified criterion, such as where the tuple count of each cell is above some threshold.
We will use the term subcube to refer to the latter case, where only some of the cells may be
precomputed for various cuboids.

Partial materialization represents an interesting trade-off between storage space and response time.
The partial materialization of cuboids or subcubes should consider three factors:
(1) Identify the subset of cuboids or subcubes to materialize;
(2) Exploit the materialized cuboids or subcubes during query processing; and
(3) Efficiently update the materialized cuboids or subcubes during load and refresh.

Dept. of ISE, BNMIT Page 7


Module 2 Data Warehouse Implementation & Data Mining

6 Explain bitmap indexing method. 6M

The bitmap indexing method is popular in OLAP products because it allows quick searching in data cubes.
The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index for a
given attribute, there is a distinct bit vector, Bv, for each value v in the attribute‟s domain. If a given
attribute‟s domain consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there
are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit representing
that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.

Eg: Bitmap indexing: In the AllElectronics data warehouse, suppose the dimension item at the top level
has four values (representing item types): “home entertainment,” “computer,” “phone,” and “security.”
Each value (e.g., “computer”) is represented by a bit vector in the item bitmap index table. Suppose that the
cube is stored as a relation table with 100,000 rows. Because the domain of item consists of four values, the
bitmap index table requires four bit vectors (or lists), each with 100,000 bits. Figure 4.15 shows a base
(data) table containing the dimensions item and city, and its mapping to bitmap index tables for each of the
dimensions.

Fig: Indexing OLAP Data Using Bitmap Indexes


Bitmap indexing is advantageous compared to hash and tree indices. It is especially useful for low-
cardinality domains because comparison, join, and aggregation operations are then reduced to bit
arithmetic, which substantially reduces the processing time.
Bitmap indexing leads to significant reductions in space and input/output (I/O) since a string of characters
can be represented by a single bit. For higher-cardinality domains, the method can be adapted using
compression techniques.

7 Explain the concept of join index. 6M

The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast, join
indexing registers the joinable rows of two relations from a relational database.

For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join index
record contains the pair (RID, SID), where RID and SID are record identifiers from the R and S relations,
respectively. Hence, the join index records can identify joinable tuples without performing costly join
operations.
Join indexing is especially useful for maintaining the relationship between a foreign key and its matching
primary keys, from the joinable relation.

Dept. of ISE, BNMIT Page 8


Module 2 Data Warehouse Implementation & Data Mining

Example:

8 What is data mining? 2M

Data mining is the process of discovering interesting patterns and knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other information repositories, or data
that are streamed into the system dynamically. It is the process of sorting through large data sets to identify
patterns and establish relationships to solve problems through data analysis. Data mining tools allow
enterprises to predict future trends.
9 Explain the steps involved in KDD with a neat diagram. 6M

 Data Mining is an integral part of KDD (Knowledge Discovery in Databases).


 KDD is the overall process of converting raw data into useful information (Figure: 1.1).

 The input-data is stored in various formats such as flat files, spread sheet or relational tables.

Dept. of ISE, BNMIT Page 9


Module 2 Data Warehouse Implementation & Data Mining

 Purpose of preprocessing: to transform the raw input-data into an appropriate format for
subsequent analysis. The steps involved in data-preprocessing include
→ combine data from multiple sources
→ clean data to remove noise & duplicate observations, and
→ select records & features that are relevant to the DM task at hand
Data-preprocessing is perhaps the most time-consuming step in the overall knowledge discovery
process.
 “Closing the loop" refers to the process of integrating DM results into decision support systems.
Such integration requires a post processing step. This step ensures that only valid and useful results
are incorporated into the decision support system.
 An example of post processing is visualization. Visualization can be used to explore data and DM
results from a variety of viewpoints. Statistical measures can also be applied during post
processing to eliminate bogus DM results.

10 Explain the tasks involved in data mining OR Explain core data mining tasks. 6M
DM tasks are generally divided into 2 major categories.
 Predictive Tasks
 The objective is to predict the value of a particular attribute based on the values of other attributes.
 The attribute to be predicted is commonly known as the target or dependent variable, while the
attributes used for making the predication are known as the explanatory or independent variables.

 Descriptive Tasks
 The objective is to derive patterns (correlations, trends, clusters, trajectories and anomalies) that
summarize the relationships in data.
 Descriptive DM tasks are often exploratory in nature and frequently require post-processing
techniques to validate and explain the results.

11 How data mining tasks are categorized? Explain 8M


DM tasks are generally divided into 2 major categories.
 Predictive Tasks
 The objective is to predict the value of a particular attribute based on the values of other attributes.
 The attribute to be predicted is commonly known as the target or dependent variable, while the
attributes used for making the predication are known as the explanatory or independent variables.
 Descriptive Tasks
 The objective is to derive patterns (correlations, trends, clusters, trajectories and anomalies) that
summarize the relationships in data.
 Descriptive DM tasks are often exploratory in nature and frequently require post-processing
techniques to validate and explain the results.

Four of the Core Data Mining Tasks


1) Predictive Modeling
 This refers to the task of building a model for the target variable as a function of the explanatory
variable.
 The goal is to learn a model that minimizes the error between the predicted and true value
of the target variable.

Dept. of ISE, BNMIT Page 10


Module 2 Data Warehouse Implementation & Data Mining

 There are 2 types of predictive modeling tasks:


i) Classification: Used for discrete target variables
Ex: Web user will make purchase at an online bookstore is a classification task, because
the target variable is binary valued.
ii) Regression: used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because price is a continuous
values attribute.

2) Cluster Analysis
 This seeks to find groups of closely related observations so that observations that belong tothe
same cluster are more similar to each other than observations that belong to other clusters.
 Clustering has been used
→ to group sets of related customers
→ to find areas of the ocean that have a significant impact on the Earth's climate

3) Association Analysis
 This is used to discover patterns that describe strongly associated features in the data.
 The goal is to extract the most interesting patterns in an efficient manner.
 Useful applications include
→ finding groups of genes that have related functionality or
→ identifying web pages that are accessed together
Ex: market based analysis
 We may discover the rule that {diapers} -> {Milk}, which suggests that customers who buy
diapers also tend to buy milk.

Dept. of ISE, BNMIT Page 11


Module 2 Data Warehouse Implementation & Data Mining

4) Anomaly Detection
 This is the task of identifying observations whose characteristics are significantly different
from the rest of the data. Such observations are known as anomalies or outliers.
 The goal is
→ to discover the real anomalies and
→ to avoid falsely labeling normal objects as anomalous.
 Applications include the detection of fraud, network intrusions, and unusual patterns of disease.
 Example (Credit Card Fraud Detection).
 A credit card company records the transactions made by every credit card holder, along with
personal information such as credit limit, age, annual income, and address.
 Since the number of fraudulent cases is relatively small compared to the number of legitimate
transactions, anomaly detection techniques can be applied to build a profile of legitimate
transactions for the users.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously created profile,
then the transaction is flagged as potentially fraudulent
12 Describe the challenges that motivated the development of data mining. 10M

 Scalability
Nowadays, data-sets with sizes of terabytes or even peta bytes are becoming common.DM
algorithms must be scalable in order to handle these massive data sets. Scalability may also
require the implementation of novel data structures to access individual records in an efficient
manner. Scalability can also be improved by developing parallel & distributed algorithms.
 High Dimensionality
Traditional data-analysis technique can only deal with low dimensional data. Nowadays, data-
sets with hundreds or thousands of attributes are becoming common. Data-sets with temporal
or spatial components also tend to have high dimensionality. The computational complexity
increases rapidly as the dimensionality increases.
 Heterogeneous and Complex Data
Traditional analysis methods can deal with homogeneous type of attributes. Recent years have also
seen the emergence of more complex data-objects. DM techniques for complex objects should take into
consideration relationships in the data, such as
→ temporal& spatial autocorrelation
→ parent-child relationships between the elements in semi-structured text & XML
documents
 Data Ownership & Distribution
 Sometimes, the data is geographically distributed among resources belonging to multiple entities.
 Key challenges include:
 How to reduce amount of communication needed to perform the distributed computation
 How to effectively consolidate the DM results obtained from multiple sources &
 How to address data-security issues

 Non Traditional Analysis


 The traditional statistical approach is based on a hypothesized and test paradigm. In other words, a
hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed

Dept. of ISE, BNMIT Page 12


Module 2 Data Warehouse Implementation & Data Mining

with respect to hypothesis.


 Current data analysis tasks often require the generation and evaluation of thousands of hypotheses,
and consequently, the development of some DM techniques has been motivated by the desire to
automate the process of hypothesis generation and evaluation.

13 What are the properties necessary to describe attributes? Explain different types of attributes. 6M
OR
Explain 4 types of attributes by giving appropriate example?
The properties(operations) of number typically used to describe attribute are:
1) Distinctness: = ≠
2) Order: <>
3) Addition: + -
4) Multiplication: * /
 Given these properties, we can define four types of attributes: Nominal, ordinal, interval and
ratio. Table 2.2 gives the definitions of these types, along with information about the statistical
operations that are valid for each type.
 Each attribute type possesses all of the properties or operations of the attribute types above it. That
means,
 Nominal attribute: Uses only distinctness. Examples: ID numbers, eye color, pin codes
 Ordinal attribute: Uses distinctness & order. Examples: Grades in {SC, FC, FCD}, Shirt sizes
in {S, M, L, XL}
 Interval attribute: Uses distinctness, order & addition. Examples: calendar dates, temperatures
in Celsius or Fahrenheit.
 Ratio attribute: Uses all 4 properties. Examples: temperature in Kelvin, length, time, counts.

Dept. of ISE, BNMIT Page 13


Module 2 Data Warehouse Implementation & Data Mining

 Nominal and ordinal are collectively referred to as categorical or quantitative attributes. As the
name suggests, quantitative attributes, such as employee ID, lack most of the properties of
numbers.
 The remaining two types of attributes, interval and ratio, are collectively referred to as qualitative
or numeric attributes. Qualitative attributes are represented by numbers and have most of the
properties of numbers.

 The types of attributes can also be described in terms of transformations that do not change the
meaning of an attribute. Such transformation is called as permissible transformations.

 For example, the meaning of a length attribute is unchanged if it is measured in meters instead of
feet.

 Table 2.3 shows the permissible (meaning-preserving) transformations for the four types of
attribute

15 With example, explain


10M
i) Continuous ii) Discrete iii) Asymmetric Attributes.

i. Discrete
 Has only a finite or countable set of values.
 Examples: pin codes, ID Numbers, or the set of words in a collection of documents.
 Often represented as integer variables.
 Binary attributes are a special case of discrete attributes and assume only 2 values.E.g.
true/false, yes/no, male/female or 0/1

ii. Continuous
 Has real numbers as attribute values.
 Examples: temperature, height, or weight.

Dept. of ISE, BNMIT Page 14


Module 2 Data Warehouse Implementation & Data Mining

 Often represented as floating-point variables.

iii. Asymmetric Attributes


 For asymmetric attributes, only presence – a non-zero attribute value – is regarded as important.
 Consider a data-set where each object is a student and each attribute records whether or not
astudent took a particular course at a university.For a specific student, an attribute has a value of
1 if the student took the course associated withthat attribute and a value of 0 otherwise.
 Because students take only a small fraction of all available courses, most of the values in such
data-set would be 0.
 Therefore, it is more meaningful and more efficient to focus on the non-zero values.
 Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
 This type of attribute is particularly important for association analysis.

16 Explain general characteristics of data sets. 6M

Following 3 characteristics apply to many data-sets:

1) Dimensionality
 Dimensionality of a data-set is no. of attributes that the objects in the data-set possess.
 Data with a small number of dimensions tends to be qualitatively different than moderate or high-
dimensional data.
 The difficulties associated with analyzing high-dimensional data are sometimes referred to as the
curse of dimensionality. Because of this, an important motivation in preprocessing data is
dimensionality reduction.

2) Sparsity
 For some data-sets with asymmetric feature, most attribute of an object have values of 0.
 In practical terms, sparsity is an advantage because usually only the non-zero values need to be
stored & manipulated. This results in significant savings with respect to computation-time and
storage.
 Some DM algorithms work well only for sparse data.

3) Resolution
 This is frequently possible to obtain data at different levels of resolution, and often the properties
of the data are different at different resolutions.
 Ex: the surface of the earth seems very uneven at a resolution of few meters, but is relatively
smooth at a resolution of tens of kilometers.
 The patterns in the data also depend on the level of resolution.

17 Explain record data & its types. 10M

 Data-set is a collection of records (data objects), each of which consists of a fixed set of attributes.
Every record has the same set of attributes.
 For the most basic form of record data, there is no explicit relationship among records or attributes.
 The data is usually stored either in flat files or in relational databases

Dept. of ISE, BNMIT Page 15


Module 2 Data Warehouse Implementation & Data Mining

TYPES OF RECORD DATA


 Transaction (Market Basket Data)
• Each transaction consists of a set of items.
• Consider a grocery store. The set of products purchased by a customer represents a
transaction while the individual products represent items.
• This type of data is called market basket data because the items in each transaction are the
products in a person's "market basket."
• Data can also be viewed as a set of records whose fields are asymmetric attributes.

 Data Matrix
 If the data objects in a collection of data all have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multidimensional space.
 A set of such data objects can be interpreted as an m by n matrix, where there are m rows,
one for each object, &n columns, one for each attribute. This matrix is called a data-
matrix.
 Since data-matrix consists of numeric attributes, standard matrix operation can be applied
to manipulate the data.

 Sparse Data Matrix


• This is a special case of a data-matrix, in which the attributes are of the same type and are
asymmetric i.e. only non-zero values are important.
• Transaction data is an example for sparse data matrix that has only 0 and 1 entries.

Dept. of ISE, BNMIT Page 16


Module 2 Data Warehouse Implementation & Data Mining

• Another example is document data. A document can be represented as a „vector‟, where


each term is an attribute of the vector and value of each attribute is the no. of times
corresponding term occurs in the document as shown in the fig. 2.2(d).

19 Explain graph based data. 6M

GRAPH BASED DATA

 Sometimes, a graph can be a convenient and powerful representation for data.


 We consider 2 specific cases:
 Data with Relationships among Objects
 The relationships among objects frequently convey important information.
 In particular, the data-objects are mapped to nodes of the graph, while relationships among
objects are captured by link properties such as direction & weight.
 For ex, in web, the links to & from each page provide a great deal of information about the
relevance of a web-page to a query, and thus, must also be taken into consideration.

 Data with Objects that are Graphs


 If the objects contain sub-objects that have relationships, then such objects are frequently
represented as graphs.
 For ex, the structure of chemical compounds can be represented by a graph, where nodes are
atoms and links between nodes are chemical bonds.
 Fig. 2.3(b) shows structure of the chemical compound benzene, which contains atoms of
carbons and hydrogen.

18 Explain ordered data & its types. 10M


For some types of data, the attributes have relationships that involve order in time or space.
Different types of orders data are described below:

 Sequential Data (Temporal Data)


 This can be thought of as an extension of record-data, where each record has a time associated

Dept. of ISE, BNMIT Page 17


Module 2 Data Warehouse Implementation & Data Mining

with it.A time can also be associated with each attribute. Consider a retail transaction data set
that also stores the time at which the transaction took place.
 For example, each record could be the purchase history of a customer, with a listing of items
purchased at different times.
 Fig 2.4 (a) shows an example of sequential transaction data. There are five different times – t1,
t2, t3, t4 and t5; three different customers – C1,C2 and C3; and five different items – A, B C, D
and E.
 In the top table, each row corresponds to the items purchased at a particular time by each
customer. In bottom table, the same information is displayed, but each row corresponds to a
particular customer.

 Sequence Data
 This consists of a data-set that is a sequence of individual entities, such as a sequence of words
or letters.
 This is quite similar to sequential data, except that there are no time stamps; instead, there are
positions in an ordered sequence.
 For example, the genetic information of plants and animals can be represented in the form of
sequences of nucleotides that are known as genes.
 Fig 2.4(b) shows a section of the human genetic code expressed using the four nucleotides
from which all DNA is constructed: A, T, G and C.

 Time Series Data


 This is a special type of sequential data in which a series of measurements are taken over time.
 For example, consider fig 2.4 (c) which shows a time series of the average monthly
temperature during the years 1984 to 1994..
 An important aspect of temporal-data is temporal-autocorrelation i.e. if two measurements are
close in time, then the values of those measurements are often very similar.

 Spatial Data
 Some objects have spatial attributes, such as positions or areas.
 An example is weather-data (temperature, pressure) that is collected for a variety of
geographical location.
 An important aspect of spatial-data is spatial-autocorrelation i.e. objects that are physically
close tend to be similar in other ways as well. Thus, two points on the Earth that are close to
each other usually have similar values for temperature and rainfall.

Dept. of ISE, BNMIT Page 18


Module 2 Data Warehouse Implementation & Data Mining

19 Explain types of datasets 12M

There are three types of data sets, they are:


1) Record data
 Transaction (or Market based data)
 Data matrix
 Document data or Sparse data matrix

2) Graph-based data
 Data with relationship among objects (World Wide Web)
 Data with objects that are Graphs (Molecular Structures)

3) Ordered data
 Sequential data (Temporal data)
 Sequence data
 Time series data
 Spatial data

1) RECORD DATA
 Data-set is a collection of records (data objects), each of which consists of a fixed set of attributes.
Every record has the same set of attributes.
 For the most basic form of record data, there is no explicit relationship among records or attributes.
 The data is usually stored either in flat files or in relational databases

Dept. of ISE, BNMIT Page 19


Module 2 Data Warehouse Implementation & Data Mining

TYPES OF RECORD DATA


 Transaction (Market Basket Data)
• Each transaction consists of a set of items.
• Consider a grocery store. The set of products purchased by a customer represents a
transaction while the individual products represent items.
• This type of data is called market basket data because the items in each transaction are the
products in a person's "market basket."
• Data can also be viewed as a set of records whose fields are asymmetric attributes.

 Data Matrix
 If the data objects in a collection of data all have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multidimensional space.
 A set of such data objects can be interpreted as an m by n matrix, where there are m rows,
one for each object, &n columns, one for each attribute. This matrix is called a data-
matrix.
 Since data-matrix consists of numeric attributes, standard matrix operation can be applied
to manipulate the data.

 Sparse Data Matrix


• This is a special case of a data-matrix, in which the attributes are of the same type and are
asymmetric i.e. only non-zero values are important.
• Transaction data is an example of a -1 entries. sparse data matrix that has only 0
• Another example is document data. A document can be represented as a „vector‟, where
each term is an attribute of the vector and value of each attribute is the no. of times
corresponding term occurs in the document as shown in the fig. 2.2(d).

2. GRAPH BASED DATA

 Sometimes, a graph can be a convenient and powerful representation for data.


 We consider 2 specific cases:
 Data with Relationships among Objects
 The relationships among objects frequently convey important information.
 In particular, the data-objects are mapped to nodes of the graph, while relationships among
objects are captured by link properties such as direction & weight.
 For ex, in web, the links to & from each page provide a great deal of information aboutthe

Dept. of ISE, BNMIT Page 20


Module 2 Data Warehouse Implementation & Data Mining

relevance of a web-page to a query, and thus, must also be taken into consideration.

 Data with Objects that are Graphs


 If the objects contain sub-objects that have relationships, then such objects are frequently
represented as graphs.
 For ex, the structure of chemical compounds can be represented by a graph, where nodes are
atoms and links between nodes are chemical bonds.
 Fig. 2.3(b) shows structure of the chemical compound benzene, which contains atoms of
carbons and hydrogen.

3) ORDERED DATA

For some types of data, the attributes have relationships that involve order in time or space.
Different types of orders data are described below:

 Sequential Data (Temporal Data)


 This can be thought of as an extension of record-data,where each record has a time associated
with it.A time can also be associated with each attribute. Consider a retail transaction data set
that also stores the time at which the transaction took place.
 For example, each record could be the purchase history of a customer, with a listing ofitems
purchased at different times.
 Fig 2.4 (a) shows an example of sequential transaction data. There are five different times – t1,
t2, t3, t4 and t5; three different customers – C1,C2 and C3; and five different items – A, B C, D
and E.
 In the top table, each row corresponds to the items purchased at a particular time by each
customer. In bottom table, the same information is displayed, but each row corresponds to a
particular customer.

 Sequence Data
 This consists of a data-set that is a sequence of individual entities, such as a sequenceof words

Dept. of ISE, BNMIT Page 21


Module 2 Data Warehouse Implementation & Data Mining

or letters.
 This is quite similar to sequential data, except that there are no time stamps; instead,there are
positions in an ordered sequence.
 For example, the genetic information of plants and animals can be represented in theform of
sequences of nucleotides that are known as genes.
 Fig 2.4(b) shows a section of the human genetic code expressed using the four nucleotides
from which all DNA is constructed: A, T, G and C.

 Time Series Data


 This is a special type of sequential data in which a series of measurements are takenover time.
 For example, consider fig 2.4 (c) which shows a time series of the average monthly
temperature during the years 1984 to 1994..
 An important aspect of temporal-data is temporal-autocorrelation i.e. if twomeasurements are
close in time, then the values of those measurements are often verysimilar.

 Spatial Data
 Some objects have spatial attributes, such as positions or areas.
 An example is weather-data (temperature, pressure) that is collected for a variety
ofgeographical location.
 An important aspect of spatial-data is spatial-autocorrelation i.e. objects that arephysically
close tend to be similar in other ways as well. Thus, two points on the Earth that are close to
each other usually have similar values for temperature and rainfall.

Dept. of ISE, BNMIT Page 22


Module 2 Data Warehouse Implementation & Data Mining

20 Explain shortly any five data pre-processing approaches 10M

Data preprocessing is a broad area and consists of a number of different strategies and
techniques that are interrelated in complex way.

Different data processing techniques are:


1. Aggregation
2. Sampling
3. Dimensionality reduction
4. Feature subset selection
5. Feature creation
6. Discretization and binarization
7. Variable transformation

1) AGGREGATION
 This refers to combining 2 or more attributes (or objects) into a single attribute (or object).
 Consider a data set consisting of transactions (data objects) recording the daily sales of products in
various store location for different days over the course of a year.
 One way to aggregate transactions for this data set is to replace all the transactions of a single store
with a single storewide transaction. This reduces the hundreds or thousands of transactions and
the number of data objects is reduced to the number of stores.
 An obvious issue is how an aggregate transaction is created i.e., how the values of each attribute
are combined across all the records corresponding to a particular location.
 Quantitative attributes, such as price, are aggregated by taking a sum or an average. A qualitative
attribute, such as item, can either be omitted or summarized as the set of all the items that were
sold at that location.
.
 Motivations for aggregation:
 Data reduction: The smaller data-sets require less memory and less processing time.
 Because of aggregation, more expensive algorithm can be used.
 Aggregation can act as a change of scale by providing a high-level view of the data
insteadof a low-level view. E.g. Cities aggregated into districts, states, countries, etc
 The behavior of groups of objects is often more stable than that of individual objects.
 Disadvantage: The potential loss of interesting details.

2) SAMPLING
 This is a method used for selecting a subset of the data-objects to be analyzed.
 This is often used for bothpreliminary investigation of the data andfinal data analysis
 Obtaining & processing the entire set of “data of interest” is too expensive or time consuming.
So sampling can reduce data-size to the point where better & more expensive algorithm can be
used.

Sampling Methods
 Simple Random Sampling
• There is an equal probability of selecting any particular object.
• There are 2 variations on random sampling:

Dept. of ISE, BNMIT Page 23


Module 2 Data Warehouse Implementation & Data Mining

i) Sampling without Replacement


 As each object is selected, it is removed from the population.
ii) Sampling with Replacement
 Objects are not removed from the population as they are selected for the sample.
 The same object can be picked up more than once.
 When the population consists of different types(or number) of objects, simple random
sampling can fail to adequately represent those types of objects that are less
frequent.

 Stratified Sampling
 This starts with pre-specified groups of objects.
 In the simplest version, equal numbers of objects are drawn from each group even though
thegroups are of different sizes.
 In another variation, the number of objects drawn from each group is proportional to the
sizeof that group.

Progressive Sampling
 If proper sample-size is difficult to determine, then progressive sampling can be used.
 This method starts with a small sample, andthen increases the sample-size until a sample
of sufficient size has been obtained.
 This method requires a way to evaluate the sample to judge if it is large enough.

3) DIMENSIONALITY REDUCTION
 Key benefit: many DM algorithms work better if the dimensionality is lower.

 Benefits:
May help to eliminate irrelevant features or reduce noise.
Can lead to a more understandable model (which can be easily visualized).
Reduce amount of time and memory required by DM algorithms.
Avoid curse of dimensionality.
 The Curse of Dimensionality
 Data-analysis becomes significantly harder as the dimensionality of the data increases.
 For classification, this can mean that there are not enough data-objects to allow the
creation of a model that reliably assigns a class to all possible objects.
 For clustering, the definitions of density and the distance between points (which are
critical for clustering) become less meaningful.
 As a result, we get reduced classification accuracy & poor quality clusters.

4) FEATURE SUBSET SELECTION

 Another way to reduce the dimensionality is to use only a subset of the features.
 This might seem that such approach would lose information, this is not the case if redundant and
irrelevant features are present.
 Redundant features duplicate much or all of the information contained in one or more
other attributes. For example: purchase price of a product and the amount of sales tax
paid.

Dept. of ISE, BNMIT Page 24


Module 2 Data Warehouse Implementation & Data Mining

 Irrelevant features contain almost no useful information for the DM task at hand. For
example: students' ID numbers are irrelevant to the task of predicting students' grade
point averages.

 Techniques for Feature Selection


Embedded approaches: Feature selection occurs naturally as part of DM algorithm.
Specifically, during the operation of the DM algorithm, the algorithm itself decides
which attributes to use and which to ignore.
Filter approaches: Features are selected before the DM algorithm is run, using some
approach that is independent of the data mining task.
Wrapper approaches: these methods use the DM algorithm as a black box to find best
subset of attributes.

An Architecture for Feature Subset Selection


 The feature selection process is viewed as consisting of 4 parts:
1) A measure of evaluating a subset,
2) A search strategy that controls the generation of a new subset of features,
3) A stopping criterion and
4) A validation procedure.

 Filter and wrapper methods differ only in the way in which they evaluate a subset of features. For
a wrapper method, subset evaluation uses the target data mining algorithm, while for a filter
approach, the evaluation technique is distinct from the target data mining algorithm.

 Feature subset selection is a search over all possible subsets of features and find optimal or near
optimal sets of features.

 Evaluation step used to judge how the current subset of features compares to others that have been
considered.

 Because the number of subsets can be enormous and it is impractical to examine them all, some
sort of stopping criterion is necessary. This strategy is usually based on one or more conditions.

Dept. of ISE, BNMIT Page 25


Module 2 Data Warehouse Implementation & Data Mining

 Finally, once a subset of features has been selected, the results of the target data mining algorithm
on the selected subset should be validated.

5) FEATURE CREATION

 It is frequently possible to create from the original attributes, a new set of attributes that captures
the important information in a data set much more effectively.

 The number of new attributes can be smaller than the number of original attributes, allowing us to
reap all the previously described benefits of dimensionality reduction.

 Three methodologies for creating new attributes are: feature extraction, mapping the data to a
new space, and feature construction.

 Feature extraction

 The creation of a new set of features from the original raw data is known as feature
creation.

 Example: consider a set of photographs, where each photograph is to be classified


according to whether or not it contains a human face. The raw data is a set of pixels, and as
such, is not suitable for many types of classification algorithms.

 If the data is processed to provide higher-level features, such as the presence or absence of
certain types of edges and areas that are highly correlated with the presence of human
faces, then a many types of classification techniques can be applied to this problem.

 Mapping the data to a new space

 A totally different view of the data can reveal important and interesting features.

 For example, consider time series data, which contains periodic patterns. If there is only a
single periodic pattern and not much noise, then the pattern is easily detected. If, on other
hand, there are a number of periodic patterns and a significant amount of noise is present,
then these patterns are hard to detect.

 Such patterns can be detected by applying a Fourier transform to the time series and
produces a new data object whose attributes are related to frequencies.

 Feature construction

 Sometimes the features in the original data sets have the necessary information, but it is
not in a form suitable for the data mining algorithm.

 In this situation, one or more new features constructed out of the original features can be
more useful than the original features.

6) DISCRETIZATION AND BINARIZATION

 Some DM algorithms (especially classification algorithms) require that the data be in the form of
categorical attributes.

Dept. of ISE, BNMIT Page 26


Module 2 Data Warehouse Implementation & Data Mining

 Algorithms that find association patterns require that the data be in the form of binary attributes.
 Transforming continuous attributes into a categorical attribute is called discretization.
 And transforming continuous & discrete attributes into binary attributes is called as binarization.

Binarization
 A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in interval [0,m-1].
 Next, convert each of these m integers to a binary number.
 Since n=[log2(m)] binary digits are required to represent these integers, represent these binary
numbers using „n‟ binary attributes.

Discretizction of continuous attributes


 Discretization is typically applied to attributes that are used in classification or association
analysis.
 In general, the best discretization depends on the algorithm being used, as well as the other
attributes being considered
 Transformation of a continuous attribute to a categorical attribute involves two subtasks:
→ deciding how many categories to have and
→ determining how to map the values of the continuous attribute to these categories

7) VARIABLE TRANSFORMATION

 This refers to a transformation that is applied to all the values of a variable.


 Ex: converting a floating point value to an absolute value.
 Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of transformations include ex, 1/x, log(x), sin(x).
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.

Dept. of ISE, BNMIT Page 27


Module 2 Data Warehouse Implementation & Data Mining

• A traditional example is that of "standardizing a variable" in statistics.


• If x is the mean of the attribute values and sx is their standard deviation, then the
transformation x'=(x- x )/sx creates a new variable that has a mean of 0 and a standard
deviation of 1.

21 What is sampling? Explain simple random sampling vs. stratified sampling vs. progressive 10M
sampling.

SAMPLING
 This is a method used for selecting a subset of the data-objects to be analyzed.
 This is often used for both preliminary investigation of the data and final data analysis
 Obtaining & processing the entire set of “data of interest” is too expensive or time consuming.
So sampling can reduce data-size to the point where better & more expensive algorithm can be
used.

Sampling Methods
 Simple Random Sampling
• There is an equal probability of selecting any particular object.
• There are 2 variations on random sampling:
i) Sampling without Replacement
 As each object is selected, it is removed from the population.
ii) Sampling with Replacement
 Objects are not removed from the population as they are selected for the sample.
 The same object can be picked up more than once.
 When the population consists of different types(or number) of objects, simple random
sampling can fail to adequately represent those types of objects that are less
frequent.

 Stratified Sampling
 This starts with pre-specified groups of objects.
 In the simplest version, equal numbers of objects are drawn from each group even though
the groups are of different sizes.
 In another variation, the number of objects drawn from each group is proportional to the
size of that group.

Progressive Sampling
 If proper sample-size is difficult to determine, then progressive sampling can be used.
 This method starts with a small sample, and then increases the sample-size until a sample
of sufficient size has been obtained.
 This method requires a way to evaluate the sample to judge if it is large enough.

22 Write a short note on the following: 6M


i) Dimensionality reduction ii)Variable transformation iii)Feature selection

i. Dimensionality Reduction
 Key benefit: many DM algorithms work better if the dimensionality is lower.

Dept. of ISE, BNMIT Page 28


Module 2 Data Warehouse Implementation & Data Mining

 Benefits:
May help to eliminate irrelevant features or reduce noise.
Can lead to a more understandable model (which can be easily visualized).
Reduce amount of time and memory required by DM algorithms.
Avoid curse of dimensionality.
 The Curse of Dimensionality
 Data-analysis becomes significantly harder as the dimensionality of the data increases.
 For classification, this can mean that there are not enough data-objects to allow the
creation of a model that reliably assigns a class to all possible objects.
 For clustering, the definitions of density and the distance between points (which are
critical for clustering) become less meaningful.
 As a result, we get reduced classification accuracy & poor quality clusters.

ii. Variable Transformation


 This refers to a transformation that is applied to all the values of a variable.
 Ex: converting a floating point value to an absolute value.
 Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of transformations include ex, 1/x, log(x), sin(x).
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• A traditional example is that of "standardizing a variable" in statistics.
• If x is the mean of the attribute values and sx is their standard deviation, then the
transformation x'=(x- x )/sx creates a new variable that has a mean of 0 and a standard
deviation of 1.

iii. Feature Subset Selection

 Another way to reduce the dimensionality is to use only a subset of the features.
 This might seem that such approach would lose information, this is not the case if redundant and
irrelevant features are present.
 Redundant features duplicate much or all of the information contained in one or more
otherattributes.For example: purchase price of a product and the amount of sales tax
paid.
 Irrelevant features contain almost no useful information for the DM task at hand. For
example: students' ID numbers are irrelevant to the task of predicting students' grade
point averages

 Techniques for Feature Selection


Embedded approaches: Feature selection occurs naturally as part of DM algorithm.
Specifically, during the operation of the DM algorithm, the algorithm itself decides
which attributes to use and which to ignore.
Filter approaches: Features are selected before the DM algorithm is run, using some
approach that is independent of the data mining task.
Wrapper approaches: these methods use the DM algorithm as a black box to find best
subset of attributes.

Dept. of ISE, BNMIT Page 29


Module 2 Data Warehouse Implementation & Data Mining

23 Distinguish between
6M
i) SMC &Jaccard coefficient ii) Discretization & Binarization

 SIMPLE MATCHING COEFFICIENT

 One commonly used similarity coefficient is the SMC, which is defined as

 This measure counts both presences and absences equally.

 JACCARD COEFFICIENT

 Jaccard coefficient is frequently used to handle objects consisting of asymmetric binary attributes.
 The jaccard coefficient is given by the following equation:

24 Explain various distances used for measuring dissimilarity between objects. 6M

Distances
 The Euclidean distance, d, between 2 points, x and y, in one-,two-,three- or higher-
dimensionalspace, is given by the following familiar formula:

wheren=number of dimensions
xk andyk are respectively the kth attributes of x and y.

 The Euclidean distance measure given in equation 2.1 is generalized by the Minkowski
distancemetric given by

where r=parameter.

Dept. of ISE, BNMIT Page 30


Module 2 Data Warehouse Implementation & Data Mining

The following are the three most common examples of minkowski distance:
 r=1City block( Manhattan L1 norm) distance. A common example is the Hamming
distance, which is the number of bits that are different between two objects that have only
binary attributes ie between two binary vectors.
 r=2 Euclidean distance (L2 norm).
 r=∞ Supremum (L∞ or Lmax norm) distance. This is the maximum difference between
any attribute of the objects. Distance is defined by

25 Explain various distances used for measuring similarity between objects. 6M


Let x and y be two objects that consist of n binary attributes. The comparison of two such objects, i.e., two
binary vectors, Leads to the following four quantities (frequencies:)

Dept. of ISE, BNMIT Page 31


Module 2 Data Warehouse Implementation & Data Mining

Simple Matching Coefficient (SMC):

Jaccard Coefficient (J):

Cosine Similarity
x and y are two document vectors, then

26 Find the cosine similarity for given 2 binary vectors:


X= (3,2,0,5,0,0,0,2,0,0) 6M
Y= (1,0,0,0,0,0,0,1,0,2)

27 Consider the following 2 binary vectors


X=(1,0,0,0,0,0,0,0,0,0)
Y=(0,0,0,0,0,0,1,0,0,1) 6M
Find i) SMC ii) Jaccard Coefficient

Dept. of ISE, BNMIT Page 32

You might also like