DWDM Module 2

18CS641:Data Mining
and Data Warehousing

Module II : DATA WAREHOUSE IMPLEMENTATION AND
DATA MINING
By
Mrs. Hema N
ASSISTANT PROFESSOR
DEPARTMENT OF INFORMATION SCIENCE AND
ENGINEERING
RNS INSTITUTE OF TECHNOLOGY
CHANNASANDRA, RR NAGAR POST
Dr. VISHNUVARDHAN ROAD
BENGALURU – 560 098
hema.n@rnsit.ac.in
DATA WAREHOUSE IMPLEMENTATION AND DATA MINING
• Efficient Data Cube Computation: An Overview

At the core of multidimensional data analysis is the efficient computation of aggregations
across many sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s.
Each group-by can be represented by a cuboids, where the set of group-by’s forms a lattice
of cuboids defining a data cube.
The compute cube Operator and the Curse of Dimensionality

One approach to cube computation extends SQL so as to include a compute cube operator.
The compute cube operator computes aggregate over all subsets of the dimensions.
5/29/2021 18CS641- Data Mining and Data Warehousing 2

• Example: A data cube is a lattice of cuboids. Suppose that you want to create a data cube for All
Electronics sales that contains the following: city, item, year, and sales in dollars. You want to be able
to analyze the data, with queries such as the following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars
as the measure, the total number of cuboids, or group by’s, that can be computed for this data cube is 8.
The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item),
(year),()}, where () means that the group-by is empty (i.e., the dimensions are not grouped). These group-by’s
form a lattice of cuboids for the data cube, as shown in Figure.

The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any combination of
the three dimensions. The apex cuboid, or 0-Dcuboid, refers to the case where the group-by is empty. It contains the
total sum of all sales.
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most generalized (least
specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid and explore downward in the lattice,
this is equivalent to drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.

The “Compute Cube” Operator
• Cube definition and computation in DMQL
define cube sales_cube [item, city, year]: sum (sales_in_dollars)
compute cube sales_cube
• A major challenge related to this pre computation, however,

()
is that the required storage space may explode if all of the
cuboids in a data cube are precomputed, especially when
the cube has many dimensions. The storage requirements (city) (item) (year)
are even more excessive when many of the dimensions
have associated concept hierarchies, each with multiple
levels. This problem is referred to as the curse of (city, item) (city, year) (item, year)
dimensionality.
(city, item, year)

• “How many cuboids are there in an n-dimensional data cube?”
For example, time is usually explored not at only one conceptual level (e.g., year), but rather at
multiple conceptual levels such as in the hierarchy “day < month < quarter < year.” For an n-
dimensional data cube, the total number of cuboids that can be generated (including the cuboids
generated by climbing up the hierarchies along each dimension) is
n
T =  ( Li +1)
i =1
where Li is the number of levels associated with dimension i. One is added to Li to include the virtual top
level, all.
This formula is based on the fact that, at most, one abstraction level in each
dimension will appear in a cuboid.
For example, the time dimension as specified before has four conceptual levels, or
five if we include the virtual level all. If the cube has 10 dimensions and each
dimension has five levels (including all), the total number of cuboids that can be
generated is 5 pow 10
The size of each cuboid also depends on the cardinality (i.e., number of distinct
values) of each dimension.
For example, if the All- Electronics branch in each city sold every item, there would
be |city| x |item| tuples in the city_item group-by alone.
As the number of dimensions, number of conceptual hierarchies, or cardinality

increases, the storage space required for many of the group-by’s will exceed.
it is unrealistic to precompute and materialize all of the cuboids that can possibly be generated for a data
cube
If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial materialization;
that is, to materialize only some of the possible cuboids that can be generated.

A popular approach is to materialize the cuboids set on which other frequently
referenced cuboids are based. Alternatively, we can compute an iceberg cube,
which is a data cube that stores only those cube cells with an aggregate value (e.g.,
count) that is above some minimum support threshold.
Another common strategy is to materialize a shell cube. This involves precomputing

the cuboids for only a small number of dimensions (e.g., three to five) of a data
cube.
• Once the selected cuboids have been materialized, it is important to take

advantage of them during query processing.
• Finally, during load and refresh, the materialized cuboids should be updated
efficiently.

Efficient Data Cube Computation
• Data cube can be viewed as a lattice of cuboids
• The bottom-most cuboid is the base cuboid
• The top-most cuboid (apex) contains only one cell
• How many cuboids in an n-dimensional cube with L levels?
n
T =  ( Li +1)
i =1
• Materialization of data cube
• Materialize every cuboid (full materialization),
• none (no materialization), or
• some (partial materialization) Selectively compute a proper subset of the whole set of
possible cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells
that satisfy some user-specified criterion, such as where the tuple count of each cell is above some
threshold (subcube).

• The partial materialization of cuboids or subcubes should consider three factors:
(1) identify the subset of cuboids or subcubes to materialize; (2) exploit the materialized
cuboids or subcubes during query processing; and (3) efficiently update the materialized
cuboids or subcubes during load and refresh.
(1) The selection of the subset of cuboids or subcubes to materialize should take into
account the queries in the workload, their frequencies, and their accessing costs.
In addition, it should consider workload characteristics, the cost for incremental updates,
and the total storage requirements.
The selection must also consider the broad context of physical database design such as
the generation and selection of indices.

• A popular approach is to materialize the cuboids set on which other frequently referenced
cuboids are based. Alternatively, we can compute an iceberg cube, which is a data cube
that stores only those cube cells with an aggregate value (e.g., count) that is above some
minimum support threshold.
• Another common strategy is to materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three to five) of a data cube.
(2) Once the selected cuboids have been materialized, it is important to take advantage of
them during query processing. This involves several issues, such as how to determine the
relevant cuboid(s) from among the candidate materialized cuboids, how to use available
index structures on the materialized cuboids, and how to transform the OLAP operations
onto the selected cuboid(s).
(3) Finally, during load and refresh, the materialized cuboids should be updated efficiently.

Indexing OLAP Data: Bitmap Index
To facilitate efficient data accessing, most data warehouse systems support index structures
and materialized views (using cuboids).
The bitmap indexing method is popular in OLAP products because it allows quick searching in
data cubes.
The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap
index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s
domain.
If a given attribute’s domain consists of n values, then n bits are needed for each entry in the
bitmap index.
If the attribute has the value v for a given row in the data table, then the bit representing that
value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are
set to 0.

Indexing OLAP Data: Join Indices
• join indexing registers the joinable rows of two relations from a relational database.
• For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then
the join index record contains the pair (RID, SID), where RID and SID are record
identifiers from the R and S relations, respectively.
• Hence, the join index records can identify joinable tuples without performing costly join
operations.

For example, the “Main Street” value in the location dimension table joins with tuples T57, T238, and
T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the sales
fact table.
The corresponding join index tables are shown in Figure.

Efficient Processing of OLAP Queries
• Determine which operations should be performed on the available cuboids
For example, slicing and dicing a data cube may correspond to selection and/or projection operations on
a materialized cuboid.
• Determine to which materialized cuboid(s) the relevant operations should be applied
Which should be selected to process the query?

• cuboid 2 cannot be used because country is a more general concept than province or state. Cuboids 1,
3, and 4 can be used to process the query because
(1) they have the same set of the dimensions in the query, (2) the selection clause in the query can imply
the selection in the cuboid, and (3) the abstraction levels for the item and location dimensions in
these cuboids are at a finer level than brand and province or state, respectively.
• “How would the costs of each cuboid compare if used to process the query?” It is likely that using cuboid
1 would cost the most because both item name and city are at a lower level than the brand and
province or state concepts specified in the query.
cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query.
Some cost-based estimation is required to decide which set of cuboids should be selected for
query processing.

OLAP Server Architectures: ROLAP versus MOLAP versus
HOLAP
• OLAP servers present business users with multidimensional data from data warehouses or data
marts, without concerns regarding how or where the data are stored.
• However, the physical architecture and implementation of OLAP servers must consider data
storage issues. Implementations of a warehouse server for OLAP processing include the
following:
• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse
data
• Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services- manages volatile data
• Greater scalability

• Multidimensional OLAP (MOLAP)
• These servers support multidimensional views of data through array-based multidimensional storage engines.
• They map multidimensional views directly to data cube array structures.
• The advantage of using a cube is that it allows fast indexing to pre computed summarized data.
• Fast query processing
• Hybrid OLAP (HOLAP) The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.
• For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.
• Specialized SQL servers
Specialized SQL servers that provides advanced query language and

processing

What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

Motivating Challenges
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis

Data Mining Tasks …
Data
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Milk BREAD

• Data mining tasks are generally divided into two major categories:
Predictive tasks:
Use some variables to predict unknown or future values of other variables.
Ex: by seeing the behavior of one variable we can decide the value of other
variable.
The attribute to be predicted is called: target or dependent
Attribute used for making prediction are called: explanatory or independent
Variable
Descriptive tasks:
Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that
summarize the relationships in data.
They are needed post processing the data to validate and explain the results.

• Four of the Core data Mining tasks:
1) Predictive Modelling
2) Association analysis
3) Cluster analysis
4) Anomaly detection
• Predictive Modeling: refers to the task of building a model or the target variable as a function of the
explanatory variable.
• There are two types of predictive modeling tasks:
1) Classification: used for discrete target variables
Ex: Web user will make purchase at an online bookstores a classification task, because the target variable is
binary valued.
2) Regression: used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because price is a continuous values attribute.

• Association Analysis: useful application of association is to find group of data
that have related functionality.
• The Goal of associating analysis is to extract the most of interesting patterns in
an efficient manner.
• Ex: market based analysis:
• We may discover the rule that {diapers} {Milk},which suggests that customers
who buy diapers also tend to buy milk.
• 3) Cluster Analysis: clustering has been used to group sets of related customers.
• EX: collection of news articles in below table shows that first 3 rows speak about
economy and second 3 lines speak about health sector. A good clustering
algorithm should be able to identify these two clusters based on the similarity
between words that appear in the article.

Anomaly Detection: is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are known as anomalies or
outliers.

What is Data?
• Dataset is a Collection of data Attributes
objects.
• An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
• Examples: eye color of a person, 1 Yes Single 125K No
temperature, etc. 2 No Married 100K No
• Attribute is also known as 3 No Single 70K No
Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No
• A collection of attributes
6 No Married 60K No
describe an object
• Object is also known as record, 7 Yes Divorced 220K No
point, case, sample, entity, or 8 No Single 85K Yes
instance 9 No Married 75K No
10
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Distinction between attributes and attribute values

• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
•

Properties of Attribute Values
• The type of an attribute depends on which of the following
properties/operations it possesses:
• Distinctness: = 
• Order: < >
• Addition + -
• Multiplication * /

Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test
Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric
values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
This categorization of attributes is due to S. S. Stevens

Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.
Interval new_value = a * old_value + b Thus, the Fahrenheit and

where a and b are constants Celsius temperature scales
Quantitative
Numeric
differ in terms of where their

zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
This categorization of attributes is due to S. S. Stevens

Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Continuous attributes are typically represented as floating-point variables.
• Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded as important
• Words present in documents
• Items present in customer transactions
Important Characteristics of Data
• Dimensionality (number of attributes)

• High dimensional data brings a number of challenges
• Curse of dimensionality
• Sparsity
• Only presence counts (non zero)
• Saves computation time and storage
• Resolution
• Data are different at different resolutions
• Ex: surface of the earth

Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Sequential Data
• Genetic Sequence Data
• Time series data
• Spatial data

Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

3 No Single 70K No
4 Yes Married 120K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Data •Matrix
If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Document Data
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the corresponding term
occurs in the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0

Transaction Data
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
• Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Benzene Molecule: C6H6

Ordered Data
• Sequences of transactions Time series data

Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean

Data Quality
• Poor data quality negatively affects many data processing efforts
• Data mining example: a classification model for detecting people who are loan
risks is built using poor data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default

Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone
• The figures below show two sine waves of the same magnitude and different
frequencies, the waves combined, and the two sine waves with random
noise
• The magnitude and shape of the original signal is distorted

Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values

• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis

Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues

Data Preprocessing
The preprocessing steps should be applied to make data more suitable for data
mining,
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation

Aggregation
• Combining two or more attributes (or objects) into a single attribute
(or object)
• Purpose
• Data reduction - reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data - aggregated data tends to have less variability

Sampling
• Sampling is the main technique employed for data
reduction.
• It is often used for both the preliminary investigation of the
data and the final data analysis.
• Statisticians often sample because obtaining the entire

set of data of interest is too expensive or time
consuming.
• Sampling is typically used in data mining because

processing the entire set of data of interest is too
expensive or time consuming.

Sampling …
• The key principle for effective sampling is the following:
• Using a sample will work almost well as using the entire data set, if the
sample is representative
• A sample is representative if it has approximately the same properties (of

interest) as the original set of data

Sample Size
8000 points 2000 Points 500 Points

Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up
more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
• Progressive Sampling-small to increased sample size

Discretization
• Discretization is the process of converting continuous attributes into categorical
attributes.
• Discretization of continuous attributes

In general, the best discretization depends on
→the algorithm being used, as well as
→the other attributes being considered
Transformation of a continuous attribute to a categorical attribute involves two subtasks:
→deciding how many categories to have and
→determining how to map the values of the continuous attribute to these
categories

Binarization
• Binarization maps a continuous and discrete attributes into one or
more binary variables

Attribute Transformation
• An attribute transform is a function that maps the entire set of
values of a given attribute to a new set of replacement values
such that each old value can be identified with one of the new
values
• Simple functions: xk, log(x), ex, |x|
• Normalization (or standardization)
• Refers to various techniques to adjust differences among
attributes in terms of frequency of occurrence, mean, variance,
range
• In statistics, standardization refers to subtracting off the means
and dividing by the standard deviation
• Transformation creates a new variable that has a mean of 0 and a
standard deviation of 1.
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse
in the space that it occupies
• Definitions of density and distance between points, which are

critical for clustering and outlier detection, become less meaningful
• As a result- reduced classification accuracy, poor clusters

Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques

Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other
attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students'
GPA

Techniques for Feature Selection
1) Embedded approaches: Feature selection occurs naturally as part of DM algorithm. Specifically,
during the operation of the DM algorithm, the algorithm itself decides which Attributes to use
and which to ignore.
2) Filter approaches: Features are selected before the DM algorithm is run.
3) Wrapper approaches: Use DM algorithm as a black box to find best subset of attributes.

Architecture for Feature Subset Selection
•The features selection process is viewed as consisting of 4 parts:
1) A measure of evaluating a subset,
2) A search strategy that controls the generation of a new subset of features,
3) A stopping criterion and
4)A validation procedure.

Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.

Euclidean Distance
• Euclidean Distance
where n is the number of dimensions (attributes) and xk

and yk are, respectively, the kth attributes (components) or
data objects x and y.
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and

xk and yk are, respectively, the kth attributes (components) or data
objects x and y.

Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different
between two binary vectors
• r = 2. Euclidean distance (L2 norm)
• r → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.

Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.
• A distance that satisfies these properties is a

metric

Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
where s(x, y) is the similarity between points (data

objects), x and y.

Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
If x and y are two document vectors, then

Correlation measures the linear relationship between objects

DWDM Module 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDM Module 2

Uploaded by

Copyright:

Available Formats

18CS641:Data Mining

and Data Warehousing

• Efficient Data Cube Computation: An Overview

The compute cube Operator and the Curse of Dimensionality

5/29/2021 18CS641- Data Mining and Data Warehousing 2

5/29/2021 18CS641- Data Mining and Data Warehousing 3

5/29/2021 18CS641- Data Mining and Data Warehousing 4

• A major challenge related to this pre computation, however,

(city, item, year)

5/29/2021 18CS641- Data Mining and Data Warehousing 5

As the number of dimensions, number of conceptual hierarchies, or cardinality

5/29/2021 18CS641- Data Mining and Data Warehousing 7

Another common strategy is to materialize a shell cube. This involves precomputing

• Once the selected cuboids have been materialized, it is important to take

5/29/2021 18CS641- Data Mining and Data Warehousing 8

5/29/2021 18CS641- Data Mining and Data Warehousing 9

5/29/2021 18CS641- Data Mining and Data Warehousing 10

5/29/2021 18CS641- Data Mining and Data Warehousing 11

5/29/2021 18CS641- Data Mining and Data Warehousing 12

5/29/2021 18CS641- Data Mining and Data Warehousing 14

5/29/2021 18CS641- Data Mining and Data Warehousing 15

Which should be selected to process the query?

5/29/2021 18CS641- Data Mining and Data Warehousing 16

5/29/2021 18CS641- Data Mining and Data Warehousing 17

5/29/2021 18CS641- Data Mining and Data Warehousing 18

• Specialized SQL servers

Specialized SQL servers that provides advanced query language and

5/29/2021 18CS641- Data Mining and Data Warehousing 19

5/29/2021 18CS641- Data Mining and Data Warehousing 20

• Heterogeneous and Complex Data

• Data Ownership and Distribution

5/29/2021 18CS641- Data Mining and Data Warehousing 21

1 Yes Single 125K No

5/29/2021 18CS641- Data Mining and Data Warehousing 22

5/29/2021 18CS641- Data Mining and Data Warehousing 23

5/29/2021 18CS641- Data Mining and Data Warehousing 24

5/29/2021 18CS641- Data Mining and Data Warehousing 25

5/29/2021 18CS641- Data Mining and Data Warehousing 26

• Distinction between attributes and attribute values

• Different attributes can be mapped to the same set of values

5/29/2021 18CS641- Data Mining and Data Warehousing 28

5/29/2021 18CS641- Data Mining and Data Warehousing 29

Ordinal Ordinal attribute hardness of minerals, median,

values are correlation, t and

This categorization of attributes is due to S. S. Stevens

Interval new_value = a * old_value + b Thus, the Fahrenheit and

differ in terms of where their

This categorization of attributes is due to S. S. Stevens

5/29/2021 18CS641- Data Mining and Data Warehousing 31

• Dimensionality (number of attributes)

5/29/2021 18CS641- Data Mining and Data Warehousing 33

5/29/2021 18CS641- Data Mining and Data Warehousing 34

1 Yes Single 125K No

5/29/2021 18CS641- Data Mining and Data Warehousing 35

• Such a data set can be represented by an m by n matrix, where

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

5/29/2021 18CS641- Data Mining and Data Warehousing 36

5/29/2021 18CS641- Data Mining and Data Warehousing 37

Benzene Molecule: C6H6

5/29/2021 18CS641- Data Mining and Data Warehousing 40

5/29/2021 18CS641- Data Mining and Data Warehousing 41