Professional Documents
Culture Documents
By
Mrs. Hema N
ASSISTANT PROFESSOR
DEPARTMENT OF INFORMATION SCIENCE AND
ENGINEERING
RNS INSTITUTE OF TECHNOLOGY
CHANNASANDRA, RR NAGAR POST
Dr. VISHNUVARDHAN ROAD
BENGALURU – 560 098
hema.n@rnsit.ac.in
DATA WAREHOUSE IMPLEMENTATION AND DATA MINING
Each group-by can be represented by a cuboids, where the set of group-by’s forms a lattice
of cuboids defining a data cube.
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars
as the measure, the total number of cuboids, or group by’s, that can be computed for this data cube is 8.
The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item),
(year),()}, where () means that the group-by is empty (i.e., the dimensions are not grouped). These group-by’s
form a lattice of cuboids for the data cube, as shown in Figure.
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most generalized (least
specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid and explore downward in the lattice,
this is equivalent to drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.
n
T = ( Li +1)
i =1
where Li is the number of levels associated with dimension i. One is added to Li to include the virtual top
level, all.
This formula is based on the fact that, at most, one abstraction level in each
dimension will appear in a cuboid.
For example, the time dimension as specified before has four conceptual levels, or
five if we include the virtual level all. If the cube has 10 dimensions and each
dimension has five levels (including all), the total number of cuboids that can be
generated is 5 pow 10
5/29/2021 18CS641- Data Mining and Data Warehousing 6
The size of each cuboid also depends on the cardinality (i.e., number of distinct
values) of each dimension.
For example, if the All- Electronics branch in each city sold every item, there would
be |city| x |item| tuples in the city_item group-by alone.
If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial materialization;
that is, to materialize only some of the possible cuboids that can be generated.
(1) The selection of the subset of cuboids or subcubes to materialize should take into
account the queries in the workload, their frequencies, and their accessing costs.
In addition, it should consider workload characteristics, the cost for incremental updates,
and the total storage requirements.
The selection must also consider the broad context of physical database design such as
the generation and selection of indices.
• Another common strategy is to materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three to five) of a data cube.
(2) Once the selected cuboids have been materialized, it is important to take advantage of
them during query processing. This involves several issues, such as how to determine the
relevant cuboid(s) from among the candidate materialized cuboids, how to use available
index structures on the materialized cuboids, and how to transform the OLAP operations
onto the selected cuboid(s).
(3) Finally, during load and refresh, the materialized cuboids should be updated efficiently.
• “How would the costs of each cuboid compare if used to process the query?” It is likely that using cuboid
1 would cost the most because both item name and city are at a lower level than the brand and
province or state concepts specified in the query.
cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query.
Some cost-based estimation is required to decide which set of cuboids should be selected for
query processing.
• OLAP servers present business users with multidimensional data from data warehouses or data
marts, without concerns regarding how or where the data are stored.
• However, the physical architecture and implementation of OLAP servers must consider data
storage issues. Implementations of a warehouse server for OLAP processing include the
following:
• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse
data
• Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services- manages volatile data
• Greater scalability
• Hybrid OLAP (HOLAP) The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.
• For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.
• High Dimensionality
• Non-traditional Analysis
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk BREAD
Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No
5 No Divorced 95K Yes
• A collection of attributes
6 No Married 60K No
describe an object
• Object is also known as record, 7 Yes Divorced 220K No
point, case, sample, entity, or 8 No Single 85K Yes
instance 9 No Married 75K No
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Multiplication * /
• Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded as important
• Words present in documents
• Items present in customer transactions
5/29/2021 18CS641- Data Mining and Data Warehousing 32
Important Characteristics of Data
• Sparsity
• Only presence counts (non zero)
• Saves computation time and storage
• Resolution
• Data are different at different resolutions
• Ex: surface of the earth
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
5/29/2021 18CS641- Data Mining and Data Warehousing 38
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
• Data mining example: a classification model for detecting people who are loan
risks is built using poor data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
• Using a sample will work almost well as using the entire data set, if the
sample is representative
• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
3) Wrapper approaches: Use DM algorithm as a black box to find best subset of attributes.
• Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
5/29/2021 18CS641- Data Mining and Data Warehousing 67
Minkowski Distance
• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
5/29/2021 18CS641- Data Mining and Data Warehousing 70
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y) 0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.