You are on page 1of 76

18CS641:Data Mining

and Data Warehousing


Module II : DATA WAREHOUSE IMPLEMENTATION AND
DATA MINING

By
Mrs. Hema N
ASSISTANT PROFESSOR
DEPARTMENT OF INFORMATION SCIENCE AND
ENGINEERING
RNS INSTITUTE OF TECHNOLOGY
CHANNASANDRA, RR NAGAR POST
Dr. VISHNUVARDHAN ROAD
BENGALURU – 560 098
hema.n@rnsit.ac.in
DATA WAREHOUSE IMPLEMENTATION AND DATA MINING

• Efficient Data Cube Computation: An Overview


At the core of multidimensional data analysis is the efficient computation of aggregations
across many sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s.

Each group-by can be represented by a cuboids, where the set of group-by’s forms a lattice
of cuboids defining a data cube.

The compute cube Operator and the Curse of Dimensionality


One approach to cube computation extends SQL so as to include a compute cube operator.
The compute cube operator computes aggregate over all subsets of the dimensions.

5/29/2021 18CS641- Data Mining and Data Warehousing 2


• Example: A data cube is a lattice of cuboids. Suppose that you want to create a data cube for All
Electronics sales that contains the following: city, item, year, and sales in dollars. You want to be able
to analyze the data, with queries such as the following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”

Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars
as the measure, the total number of cuboids, or group by’s, that can be computed for this data cube is 8.

The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item),
(year),()}, where () means that the group-by is empty (i.e., the dimensions are not grouped). These group-by’s
form a lattice of cuboids for the data cube, as shown in Figure.

5/29/2021 18CS641- Data Mining and Data Warehousing 3


The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any combination of
the three dimensions. The apex cuboid, or 0-Dcuboid, refers to the case where the group-by is empty. It contains the
total sum of all sales.

The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most generalized (least
specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid and explore downward in the lattice,
this is equivalent to drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.

5/29/2021 18CS641- Data Mining and Data Warehousing 4


The “Compute Cube” Operator
• Cube definition and computation in DMQL
define cube sales_cube [item, city, year]: sum (sales_in_dollars)
compute cube sales_cube

• A major challenge related to this pre computation, however,


()
is that the required storage space may explode if all of the
cuboids in a data cube are precomputed, especially when
the cube has many dimensions. The storage requirements (city) (item) (year)
are even more excessive when many of the dimensions
have associated concept hierarchies, each with multiple
levels. This problem is referred to as the curse of (city, item) (city, year) (item, year)
dimensionality.

(city, item, year)

5/29/2021 18CS641- Data Mining and Data Warehousing 5


• “How many cuboids are there in an n-dimensional data cube?”
For example, time is usually explored not at only one conceptual level (e.g., year), but rather at
multiple conceptual levels such as in the hierarchy “day < month < quarter < year.” For an n-
dimensional data cube, the total number of cuboids that can be generated (including the cuboids
generated by climbing up the hierarchies along each dimension) is

n
T =  ( Li +1)
i =1
where Li is the number of levels associated with dimension i. One is added to Li to include the virtual top
level, all.

This formula is based on the fact that, at most, one abstraction level in each
dimension will appear in a cuboid.
For example, the time dimension as specified before has four conceptual levels, or
five if we include the virtual level all. If the cube has 10 dimensions and each
dimension has five levels (including all), the total number of cuboids that can be
generated is 5 pow 10
5/29/2021 18CS641- Data Mining and Data Warehousing 6
The size of each cuboid also depends on the cardinality (i.e., number of distinct
values) of each dimension.

For example, if the All- Electronics branch in each city sold every item, there would
be |city| x |item| tuples in the city_item group-by alone.

As the number of dimensions, number of conceptual hierarchies, or cardinality


increases, the storage space required for many of the group-by’s will exceed.
it is unrealistic to precompute and materialize all of the cuboids that can possibly be generated for a data
cube

If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial materialization;
that is, to materialize only some of the possible cuboids that can be generated.

5/29/2021 18CS641- Data Mining and Data Warehousing 7


A popular approach is to materialize the cuboids set on which other frequently
referenced cuboids are based. Alternatively, we can compute an iceberg cube,
which is a data cube that stores only those cube cells with an aggregate value (e.g.,
count) that is above some minimum support threshold.

Another common strategy is to materialize a shell cube. This involves precomputing


the cuboids for only a small number of dimensions (e.g., three to five) of a data
cube.

• Once the selected cuboids have been materialized, it is important to take


advantage of them during query processing.
• Finally, during load and refresh, the materialized cuboids should be updated
efficiently.

5/29/2021 18CS641- Data Mining and Data Warehousing 8


Efficient Data Cube Computation
• Data cube can be viewed as a lattice of cuboids
• The bottom-most cuboid is the base cuboid
• The top-most cuboid (apex) contains only one cell
• How many cuboids in an n-dimensional cube with L levels?
n
T =  ( Li +1)
i =1
• Materialization of data cube
• Materialize every cuboid (full materialization),
• none (no materialization), or
• some (partial materialization) Selectively compute a proper subset of the whole set of
possible cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells
that satisfy some user-specified criterion, such as where the tuple count of each cell is above some
threshold (subcube).

5/29/2021 18CS641- Data Mining and Data Warehousing 9


• The partial materialization of cuboids or subcubes should consider three factors:
(1) identify the subset of cuboids or subcubes to materialize; (2) exploit the materialized
cuboids or subcubes during query processing; and (3) efficiently update the materialized
cuboids or subcubes during load and refresh.

(1) The selection of the subset of cuboids or subcubes to materialize should take into
account the queries in the workload, their frequencies, and their accessing costs.
In addition, it should consider workload characteristics, the cost for incremental updates,
and the total storage requirements.
The selection must also consider the broad context of physical database design such as
the generation and selection of indices.

5/29/2021 18CS641- Data Mining and Data Warehousing 10


• A popular approach is to materialize the cuboids set on which other frequently referenced
cuboids are based. Alternatively, we can compute an iceberg cube, which is a data cube
that stores only those cube cells with an aggregate value (e.g., count) that is above some
minimum support threshold.

• Another common strategy is to materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three to five) of a data cube.

(2) Once the selected cuboids have been materialized, it is important to take advantage of
them during query processing. This involves several issues, such as how to determine the
relevant cuboid(s) from among the candidate materialized cuboids, how to use available
index structures on the materialized cuboids, and how to transform the OLAP operations
onto the selected cuboid(s).

(3) Finally, during load and refresh, the materialized cuboids should be updated efficiently.

5/29/2021 18CS641- Data Mining and Data Warehousing 11


Indexing OLAP Data: Bitmap Index
To facilitate efficient data accessing, most data warehouse systems support index structures
and materialized views (using cuboids).
The bitmap indexing method is popular in OLAP products because it allows quick searching in
data cubes.
The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap
index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s
domain.
If a given attribute’s domain consists of n values, then n bits are needed for each entry in the
bitmap index.
If the attribute has the value v for a given row in the data table, then the bit representing that
value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are
set to 0.

5/29/2021 18CS641- Data Mining and Data Warehousing 12


5/29/2021 18CS641- Data Mining and Data Warehousing 13
Indexing OLAP Data: Join Indices
• join indexing registers the joinable rows of two relations from a relational database.
• For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then
the join index record contains the pair (RID, SID), where RID and SID are record
identifiers from the R and S relations, respectively.
• Hence, the join index records can identify joinable tuples without performing costly join
operations.

5/29/2021 18CS641- Data Mining and Data Warehousing 14


For example, the “Main Street” value in the location dimension table joins with tuples T57, T238, and
T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the sales
fact table.
The corresponding join index tables are shown in Figure.

5/29/2021 18CS641- Data Mining and Data Warehousing 15


Efficient Processing of OLAP Queries
• Determine which operations should be performed on the available cuboids
For example, slicing and dicing a data cube may correspond to selection and/or projection operations on
a materialized cuboid.
• Determine to which materialized cuboid(s) the relevant operations should be applied

Which should be selected to process the query?

5/29/2021 18CS641- Data Mining and Data Warehousing 16


• cuboid 2 cannot be used because country is a more general concept than province or state. Cuboids 1,
3, and 4 can be used to process the query because
(1) they have the same set of the dimensions in the query, (2) the selection clause in the query can imply
the selection in the cuboid, and (3) the abstraction levels for the item and location dimensions in
these cuboids are at a finer level than brand and province or state, respectively.

• “How would the costs of each cuboid compare if used to process the query?” It is likely that using cuboid
1 would cost the most because both item name and city are at a lower level than the brand and
province or state concepts specified in the query.

cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query.
Some cost-based estimation is required to decide which set of cuboids should be selected for
query processing.

5/29/2021 18CS641- Data Mining and Data Warehousing 17


OLAP Server Architectures: ROLAP versus MOLAP versus
HOLAP

• OLAP servers present business users with multidimensional data from data warehouses or data
marts, without concerns regarding how or where the data are stored.
• However, the physical architecture and implementation of OLAP servers must consider data
storage issues. Implementations of a warehouse server for OLAP processing include the
following:
• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse
data
• Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services- manages volatile data
• Greater scalability

5/29/2021 18CS641- Data Mining and Data Warehousing 18


• Multidimensional OLAP (MOLAP)
• These servers support multidimensional views of data through array-based multidimensional storage engines.
• They map multidimensional views directly to data cube array structures.
• The advantage of using a cube is that it allows fast indexing to pre computed summarized data.
• Fast query processing

• Hybrid OLAP (HOLAP) The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.
• For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.

• Specialized SQL servers

Specialized SQL servers that provides advanced query language and


processing

5/29/2021 18CS641- Data Mining and Data Warehousing 19


What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

5/29/2021 18CS641- Data Mining and Data Warehousing 20


Motivating Challenges
• Scalability

• High Dimensionality

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• Non-traditional Analysis

5/29/2021 18CS641- Data Mining and Data Warehousing 21


Data Mining Tasks …

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk BREAD

5/29/2021 18CS641- Data Mining and Data Warehousing 22


• Data mining tasks are generally divided into two major categories:
Predictive tasks:
Use some variables to predict unknown or future values of other variables.
Ex: by seeing the behavior of one variable we can decide the value of other
variable.
The attribute to be predicted is called: target or dependent
Attribute used for making prediction are called: explanatory or independent
Variable
Descriptive tasks:
Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that
summarize the relationships in data.
They are needed post processing the data to validate and explain the results.

5/29/2021 18CS641- Data Mining and Data Warehousing 23


• Four of the Core data Mining tasks:
1) Predictive Modelling
2) Association analysis
3) Cluster analysis
4) Anomaly detection
• Predictive Modeling: refers to the task of building a model or the target variable as a function of the
explanatory variable.
• There are two types of predictive modeling tasks:
1) Classification: used for discrete target variables
Ex: Web user will make purchase at an online bookstores a classification task, because the target variable is
binary valued.
2) Regression: used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because price is a continuous values attribute.

5/29/2021 18CS641- Data Mining and Data Warehousing 24


• Association Analysis: useful application of association is to find group of data
that have related functionality.
• The Goal of associating analysis is to extract the most of interesting patterns in
an efficient manner.
• Ex: market based analysis:
• We may discover the rule that {diapers} {Milk},which suggests that customers
who buy diapers also tend to buy milk.
• 3) Cluster Analysis: clustering has been used to group sets of related customers.
• EX: collection of news articles in below table shows that first 3 rows speak about
economy and second 3 lines speak about health sector. A good clustering
algorithm should be able to identify these two clusters based on the similarity
between words that appear in the article.

5/29/2021 18CS641- Data Mining and Data Warehousing 25


Anomaly Detection: is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are known as anomalies or
outliers.

5/29/2021 18CS641- Data Mining and Data Warehousing 26


What is Data?
• Dataset is a Collection of data Attributes
objects.
• An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
• Examples: eye color of a person, 1 Yes Single 125K No
temperature, etc. 2 No Married 100K No
• Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No
5 No Divorced 95K Yes
• A collection of attributes
6 No Married 60K No
describe an object
• Object is also known as record, 7 Yes Divorced 220K No
point, case, sample, entity, or 8 No Single 85K Yes
instance 9 No Married 75K No
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object

• Distinction between attributes and attribute values


• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers

5/29/2021 18CS641- Data Mining and Data Warehousing 28


Properties of Attribute Values
• The type of an attribute depends on which of the following
properties/operations it possesses:
• Distinctness: = 
• Order: < >
• Addition + -

• Multiplication * /

5/29/2021 18CS641- Data Mining and Data Warehousing 29


Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


5/29/2021 18CS641- Data Mining and Data Warehousing 30
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens

5/29/2021 18CS641- Data Mining and Data Warehousing 31


Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Continuous attributes are typically represented as floating-point variables.

• Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded as important
• Words present in documents
• Items present in customer transactions
5/29/2021 18CS641- Data Mining and Data Warehousing 32
Important Characteristics of Data

• Dimensionality (number of attributes)


• High dimensional data brings a number of challenges
• Curse of dimensionality

• Sparsity
• Only presence counts (non zero)
• Saves computation time and storage

• Resolution
• Data are different at different resolutions
• Ex: surface of the earth

5/29/2021 18CS641- Data Mining and Data Warehousing 33


Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Sequential Data
• Genetic Sequence Data
• Time series data
• Spatial data

5/29/2021 18CS641- Data Mining and Data Warehousing 34


Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

5/29/2021 18CS641- Data Mining and Data Warehousing 35


Data •Matrix
If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute

• Such a data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

5/29/2021 18CS641- Data Mining and Data Warehousing 36


Document Data
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

5/29/2021 18CS641- Data Mining and Data Warehousing 37


Transaction Data
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
• Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
5/29/2021 18CS641- Data Mining and Data Warehousing 38
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


5/29/2021 18CS641- Data Mining and Data Warehousing 39
Ordered Data
• Sequences of transactions Time series data

5/29/2021 18CS641- Data Mining and Data Warehousing 40


Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

5/29/2021 18CS641- Data Mining and Data Warehousing 41


Ordered Data

• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

5/29/2021 18CS641- Data Mining and Data Warehousing 42


Data Quality
• Poor data quality negatively affects many data processing efforts

• Data mining example: a classification model for detecting people who are loan
risks is built using poor data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default

5/29/2021 18CS641- Data Mining and Data Warehousing 43


Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone
• The figures below show two sine waves of the same magnitude and different
frequencies, the waves combined, and the two sine waves with random
noise
• The magnitude and shape of the original signal is distorted

5/29/2021 18CS641- Data Mining and Data Warehousing 44


5/29/2021 18CS641- Data Mining and Data Warehousing 45
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

5/29/2021 18CS641- Data Mining and Data Warehousing 46


Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis

5/29/2021 18CS641- Data Mining and Data Warehousing 47


Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

5/29/2021 18CS641- Data Mining and Data Warehousing 48


5/29/2021 18CS641- Data Mining and Data Warehousing 49
Data Preprocessing
The preprocessing steps should be applied to make data more suitable for data
mining,
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation

5/29/2021 18CS641- Data Mining and Data Warehousing 50


Aggregation
• Combining two or more attributes (or objects) into a single attribute
(or object)
• Purpose
• Data reduction - reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data - aggregated data tends to have less variability

5/29/2021 18CS641- Data Mining and Data Warehousing 51


Sampling
• Sampling is the main technique employed for data
reduction.
• It is often used for both the preliminary investigation of the
data and the final data analysis.

• Statisticians often sample because obtaining the entire


set of data of interest is too expensive or time
consuming.

• Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

5/29/2021 18CS641- Data Mining and Data Warehousing 52


Sampling …
• The key principle for effective sampling is the following:

• Using a sample will work almost well as using the entire data set, if the
sample is representative

• A sample is representative if it has approximately the same properties (of


interest) as the original set of data

5/29/2021 18CS641- Data Mining and Data Warehousing 53


Sample Size

8000 points 2000 Points 500 Points

5/29/2021 18CS641- Data Mining and Data Warehousing 54


Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up
more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
• Progressive Sampling-small to increased sample size

5/29/2021 18CS641- Data Mining and Data Warehousing 55


Discretization
• Discretization is the process of converting continuous attributes into categorical
attributes.

• Discretization of continuous attributes


In general, the best discretization depends on
→the algorithm being used, as well as
→the other attributes being considered
Transformation of a continuous attribute to a categorical attribute involves two subtasks:
→deciding how many categories to have and
→determining how to map the values of the continuous attribute to these
categories

5/29/2021 18CS641- Data Mining and Data Warehousing 56


Binarization
• Binarization maps a continuous and discrete attributes into one or
more binary variables

5/29/2021 18CS641- Data Mining and Data Warehousing 57


Attribute Transformation
• An attribute transform is a function that maps the entire set of
values of a given attribute to a new set of replacement values
such that each old value can be identified with one of the new
values
• Simple functions: xk, log(x), ex, |x|
• Normalization (or standardization)
• Refers to various techniques to adjust differences among
attributes in terms of frequency of occurrence, mean, variance,
range
• In statistics, standardization refers to subtracting off the means
and dividing by the standard deviation
• Transformation creates a new variable that has a mean of 0 and a
standard deviation of 1.
5/29/2021 18CS641- Data Mining and Data Warehousing 58
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse
in the space that it occupies

• Definitions of density and distance between points, which are


critical for clustering and outlier detection, become less meaningful

• As a result- reduced classification accuracy, poor clusters


Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques

5/29/2021 18CS641- Data Mining and Data Warehousing 60


Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other
attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students'
GPA

5/29/2021 18CS641- Data Mining and Data Warehousing 61


Techniques for Feature Selection
1) Embedded approaches: Feature selection occurs naturally as part of DM algorithm. Specifically,
during the operation of the DM algorithm, the algorithm itself decides which Attributes to use
and which to ignore.

2) Filter approaches: Features are selected before the DM algorithm is run.

3) Wrapper approaches: Use DM algorithm as a black box to find best subset of attributes.

5/29/2021 18CS641- Data Mining and Data Warehousing 62


Architecture for Feature Subset Selection
•The features selection process is viewed as consisting of 4 parts:
1) A measure of evaluating a subset,
2) A search strategy that controls the generation of a new subset of features,
3) A stopping criterion and
4)A validation procedure.

5/29/2021 18CS641- Data Mining and Data Warehousing 63


Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Proximity refers to a similarity or dissimilarity

5/29/2021 18CS641- Data Mining and Data Warehousing 64


Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.

5/29/2021 18CS641- Data Mining and Data Warehousing 65


Euclidean Distance

• Euclidean Distance

where n is the number of dimensions (attributes) and xk


and yk are, respectively, the kth attributes (components) or
data objects x and y.
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
5/29/2021 18CS641- Data Mining and Data Warehousing 67
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and


xk and yk are, respectively, the kth attributes (components) or data
objects x and y.

5/29/2021 18CS641- Data Mining and Data Warehousing 68


Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.


• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different
between two binary vectors

• r = 2. Euclidean distance (L2 norm)

• r → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the
vectors

• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.

5/29/2021 18CS641- Data Mining and Data Warehousing 69


Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
5/29/2021 18CS641- Data Mining and Data Warehousing 70
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a


metric

5/29/2021 18CS641- Data Mining and Data Warehousing 71


Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data


objects), x and y.

5/29/2021 18CS641- Data Mining and Data Warehousing 72


Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

5/29/2021 18CS641- Data Mining and Data Warehousing 73


SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

5/29/2021 18CS641- Data Mining and Data Warehousing 74


Cosine Similarity
If x and y are two document vectors, then

5/29/2021 18CS641- Data Mining and Data Warehousing 75


Correlation measures the linear relationship between objects

5/29/2021 18CS641- Data Mining and Data Warehousing 76

You might also like