You are on page 1of 12

Assignment

Data Mining: Concepts and Techniques


Chapter No. 01

Introduction
Question # 1.9:
List and describe the five primitives for
specifying a data mining task.
Answer:
The five primitives for specifying a data-mining task are:

Task-relevant data: This primitive specifies the data upon which


mining is to be performed. It involves specifying the database and
tables or data warehouse containing the relevant data, conditions
for selecting the relevant data, the relevant attributes or dimensions
for exploration, and instructions regarding the ordering or grouping
of the data retrieved.

Knowledge type to be mined: This primitive specifies the specific


data mining function to be performed, such as characterization,
discrimination, association, classification, clustering, or evolution
analysis. As well, the user can be more specific and provide pattern
templates that all discovered patterns must match. These
templates, or metapatterns (also called metarules or metaqueries),
can be used to guide the discovery process.

Background knowledge: This primitive allows users to specify


knowledge they have about the domain to be mined. Such
knowledge can be used to guide the knowledge discovery process
and evaluate the patterns that are found. Concept hierarchies and
user beliefs regarding relationships in the data are forms of
background knowledge.

Pattern interestingness measure: This primitive allows users to


specify functions that are used to separate uninteresting patterns
from knowledge and may be used to guide the mining process, as
well as to evaluate the discovered patterns. This allows the user to
confine the number of uninteresting patterns returned by the
process, as a data mining process may generate a large number of
patterns. Interestingness measures can be specified for such pattern
characteristics as simplicity, certainty, utility and novelty.

Visualization of discovered patterns: This primitive refers to the


form in which discovered patterns are to be displayed. In order for
data mining to be effective in conveying knowledge to users, data
mining systems should be able to display the discovered patterns in
multiple forms such as rules, tables, cross tabs (cross-tabulations),
pie or bar charts, decision trees, cubes or other visual
representations.
Page 1 of 12

Page 2 of 12

Question # 1.14:
Describe three challenges to data mining
regarding data mining methodology and user
interaction issues.
Answer:
Challenges to data mining regarding data mining methodology and
user interaction issues include the following: Mining different kinds of
knowledge in databases, interactive mining of knowledge at multiple
levels of abstraction, incorporation of background knowledge, data mining
query languages and ad hoc data mining, presentation and visualization of
data mining results, handling noisy or incomplete data, and pattern
evaluation. Below are the descriptions of the rst three challenges
mentioned:

Mining different kinds of knowledge in databases: Different


users are interested in different kinds of knowledge and will require
a wide range of data analysis and knowledge discovery tasks such
as data characterization, discrimination, association, classification,
clustering, trend and deviation analysis, and similarity analysis.
Each of these tasks will use the same database in different ways
and will require different data mining techniques.

Interactive mining of knowledge at multiple levels of


abstraction: Interactive mining, with the use of OLAP operations on
a data cube, allows users to focus the search for patterns, providing
and refining data mining requests based on returned results. The
user can then interactively view the data and discover patterns at
multiple granularities and from different angles.

Incorporation
of
background
knowledge:
Background
knowledge, or information regarding the domain under study such
as integrity constraints and deduction rules, may be used to guide
the discovery process and allow discovered patterns to be
expressed in concise terms and at different levels of abstraction.
This helps to focus and speed up a data mining process or judge the
interestingness of discovered patterns.

Page 3 of 12

Chapter No. 02

Data Preprocessing
Question # 2.9:

Suppose that the values for a given set of


data are grouped into intervals. The intervals and
corresponding frequencies are as follows.
Age
1-5
5-15
15-20
20-50
50-80
80-100

Frequency
200
450
300
1500
700
44

Compute an approximate median value for the


data.
Answer:
Using following equation
median = L1 +

N /2( freq ) l
width
freq median

We have
L1 = 20
N= 3194
( freq ) l = 950
freq median = 1500
width = 30
median = 32.94 years

Page 4 of 12

Question # 2.9:

Using the data for age given in Exercise 2.4,


Suppose that the data for analysis includes the
attribute age.
The age values for the data tuples are (in
increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25,
25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.
Answer the following.

(a) Use smoothing by bin means to smooth the above data,


using a bin depth of 3. Illustrate your steps. Comment on the
effect of this technique for the given data.
Answer:
The following steps are required to smooth the above data using
smoothing by bin means with a bin depth of 3.
Step 1: Sort the data. (This step is not required here as the data are
already sorted.)
Step 2: Partition the data into equal-frequency bins of size 3.
Bin 1: 13, 15, 16
Bin 4: 22, 25, 25
Bin 7: 35, 35, 35

Bin 2: 16, 19, 20


Bin 5: 25, 25, 30
Bin 8: 36, 40, 45

Bin 3: 20, 21, 22


Bin 6: 33, 33, 35
Bin 9: 46, 52, 70

Step 3: Calculate the arithmetic mean of each bin.


Step 4: Replace each of the values in each bin by the arithmetic
mean calculated for the bin.

Bin 1: 142/3, 142/3, 142/3


Bin 2: 181/3, 181/3, 181/3
Bin
3:
21, 21, 21
Bin 4: 24, 24, 24
Bin 5: 262/3, 262/3, 262/3
Bin 6: 332/3,
332/3, 332/3
Bin 7: 35, 35, 35
Bin 8: 401/3, 401/3, 401/3
Bin 9: 56, 56,
56
(b) How might you determine outliers in the data?
Answer:
Outliers in the data may be detected by clustering, where similar
values are organized into groups, or clusters". Values that fall outside of
the set of clusters may be considered outliers. Alternatively, a
combination of computer and human inspection can be used where a
predetermined data distribution is implemented to allow the computer to
identify possible outliers. These possible outliers can then be verified by
human inspection with much less effort than would be required to verify
the entire initial data set.
(c) What other methods are there for data smoothing?
Page 5 of 12

Answer:
Other methods that can be used for data smoothing include
alternate forms of binning such as smoothing by bin medians or
smoothing by bin boundaries. Alternatively, equal-width bins can be used
to implement any of the forms of binning, where the interval range of
values in each bin is constant. Methods other than binning include using
regression techniques to smooth the data by fitting it to a function such as
through linear or multiple regression. Classification techniques can be
used to implement concept hierarchies that can smooth the data by
rolling-up lower level concepts to higher-level concepts.

Page 6 of 12

Chapter No. 03

Data Warehouse and OLAP Technology: An


Overview
Question # 3.11:
In data warehouse technology, a multiple
dimensional view can be implemented by a
relational database technique (ROLAP), or by a
multidimensional database technique (MOLAP), or
by a hybrid database technique (HOLAP).
(a) Briefly describe each implementation technique.
Answer:
A ROLAP technique for implementing a multiple dimensional view
consists of intermediate servers that stand in between a relational backend server and client front-end tools, thereby using a relational or
extended-relational DBMS to store and manage warehouse data, and
OLAP middleware to support missing pieces. A MOLAP implementation
technique consists of servers, which support multidimensional views of
data through array-based multidimensional storage engines that map
multidimensional views directly to data cube array structures. A HOLAP
implementation approach combines ROLAP and MOLAP technology, which
means that large volumes of detailed data and some very low level
aggregations can be stored in a relational database, while some high level
aggregations are kept in a separate MOLAP store.
(b) For each technique, explain how each of the following
functions may be implemented:
1. The generation
aggregation)

of

data

warehouse

(including

Answer:
The generation of a data warehouse (including aggregation)
ROLAP: Using a ROLAP server, the generation of a data
warehouse can be implemented by a relational or extendedrelational DBMS using summary fact tables. The fact tables
can store aggregated data and the data at the abstraction
levels indicated by the join keys in the schema for the given
data cube.

MOLAP: In generating a data warehouse, the MOLAP


technique uses multidimensional array structures to store data
and multi-way array aggregation to compute the data cubes.

HOLAP: The HOLAP technique typically uses a relational


database to store the data and some low level aggregations,
and then uses a MOLAP to store higher-level aggregations.

2. Roll-up
Answer:
Page 7 of 12

ROLAP: To roll-up on a dimension using the summary fact


table, we look for the record in the table that contains a
generalization on the desired dimension. For example, to rollup the date dimension from day to month, select the record
for which the day field contains the special value all. The value
of the measure field, ruppees_sold, for example, given in this
record will contain the subtotal for the desired roll-up.

MOLAP: To perform a roll-up in a data cube, simply climb up


the concept hierarchy for the desired dimension. For example,
one could roll-up on the location dimension from city to
country, which is more general.

HOLAP: The roll-up using the HOLAP technique will be similar


to either ROLAP or MOLAP, depending on the techniques used
in the implementation of the corresponding dimensions.

3. Drill-down
Answer:
ROLAP: To drill-down on a dimension using the summary fact
table, we look for the record in the table that contains a
generalization on the desired dimension. For example, to drilldown on the location dimension from country to
province_or_state, select the record for which only the next
lowest field in the concept hierarchy for location contains the
special value all. In this case, the city field should contain the
value all. The value of the measure field, ruppees_sold, for
example, given in this record will contain the subtotal for the
desired drill-down.

MOLAP: To perform a drill-down in a data cube, simply step


down the concept hierarchy for the desired dimension. For
example, one could drill-down on the date dimension from
month to day in order to group the data by day rather than by
month.

HOLAP: The drill-down using the HOLAP technique is similar


either to ROLAP or MOLAP depending on the techniques used
in the implementation of the corresponding dimensions.

4. Incremental updating
Answer:
ROLAP: To perform incremental updating, check whether the
corresponding tuple is in the summary fact table. If not, insert
it into the summary table and propagate the result up.
Otherwise, update the value and propagate the result up.

MOLAP: To perform incremental updating, check whether the


corresponding cell is in the MOLAP cuboid. If not, insert it into
the cuboid and propagate the result up. Otherwise, update the
value and propagate the result up.

Page 8 of 12

(c)

HOLAP: Similar either to ROLAP or MOLAP depending on the


techniques used in the implementation of the corresponding
dimensions.

Which implementation techniques do you prefer, and why?

Answer:
HOLAP is often preferred since it integrates the strength of both ROLAP
and MOLAP methods and avoids their shortcomings. If the cube is quite
dense, MOLAP is often preferred. If the data are sparse and the
dimensionality is high, there will be too many cells (due to exponential
growth) and, in this case, it is often desirable to compute iceberg cubes
instead of materializing the complete cubes.

Chapter No. 05

Mining Frequent Patterns, Associations, and


Correlations
Question # 5.3:

A database has five transactions. Let


min_sup = 60% and min_conf = 80%.

Algorithm: Rule Generator. Given a set of frequent itemsets,


output all of its strong rules.
Input:

l, set of frequent itemsets;


min conf, the minimum condence threshold.

Output: Strong rules of itemsets in l.


Method: The method is outlined as follows:
1.
2.

for each frequent itemset, l


rule generator helper(l, l, min conf);

procedure rule generator helper (s: current subset of l; l: original


frequent itemset; min conf)
1. k = length(s);
2. if (k > 1) then {
3.
Generate all the (k 1)-subsets of s;
4.
for each (k 1)-subset x of s
5.
if (support count(l) = support count(x) = min conf)
then {
6.
output the rule x => (l - x)";
7.
rule generator helper(x, l, min conf);
8.
}
9. // else do nothing because each of x's subsets will have at
least as much
// support as x, and hence can never have greater
condence than x
10. }

Page 9 of 12

Figure 5.1: An algorithm for generating strong rules from frequent


itemsets.
Age
T100
T200
T300
T400
T400

Frequency
{M, O, N, K, E, Y}
{D, O, N, K, E, Y}
{M, A, K, E}
{M, U, C, K, Y}
{C, O, O, K, I ,E}

(a) Find all frequent itemsets using Apriori and FP-growth,


respectively. Compare the efficiency of the two mining processes.
Answer:
Apriori:

C1
=

m
o

3
3

y
d
a
u
c
i

3
1
1
1
2
1

L1
=

C3
=

ok
e
ke

k
e
y

5
4
3

m
o
m
k
m
e
m
y
ok
oe
oy
ke
ky
ey

C2
=

L3
=

ok
e

1
3
2
2
3
3
2
4
3
2

L2
=

m
k
ok

oe
ke
ky

3
4
3

2
Page 10 of 12

y
FP-growth: See Figure 5.2 for the FP-tree.

ite
m
y

conditional
tree
k:3

frequent pattern

conditional pattern
base
{ {k,e,m,o:1}, {k,e,o:1},
{k,m:1} }
{ {k,e,m:1}, {k,e:2} }

k:3,e:3

m
e

{ {k,e:2}, {k:1} }
{ {k:4} }

k:3
k:4

{k,o:3}, {e,o:3},
{k,e,o:3}
{k,m: 3}
{k,e:4}

{k,y:3}

Efficiency comparison: Apriori has to do multiple scans of the database


while FP-growth builds the FP-Tree with a single scan. Candidate
generation in Apriori is expensive (owing to the self-join), while FP-growth
does not generate any candidates.

(b) List all of the strong association rules (with support s and
confidence c) matching the following metarule, where X is a
variable representing customers, and itemi denotes variables
representing items (e.g., A", B", etc.):
Answer:

k,o e [0.6,1]
e,o k [0.6,1]

Page 11 of 12

Question # 5.3:

A database has four transactions. Let


min_sup = 60% and min_conf = 80%.

cust_I TID
Items_bought (in the form of brand-item category)
D
01 T100 {King's-Crab, Sunset-Milk, Dairyland-Cheese, Best-Bread}
02 T200 {Best-Cheese, Dairyland-Milk, Goldenfarm-Apple, Tasty-Pie,
Wonder-Bread}
01 T300 {Westcoast-Apple, Dairyland-Milk, Wonder-Bread, Tasty-Pieg}
03 T400
{Wonder-Bread, Sunset-Milk, Dairyland-Cheese}
(a) At the granularity of item category (e.g., itemi could be
Milk"), for the following rule template,

List the frequent k-itemset for the largest k, and all of the strong
association rules (with their support s and confidence c) containing the
frequent k-itemset for the largest k.
Answer:
k = 3 and the frequent 3-itemset is {Bread, Milk, Cheese}. The rules
are:
Bread Cheese => Milk,

[75%, 100%]

Chees Milk => Bread,

[75%, 100%]

Chees => Milk Bread,

[75%, 100%]

(b) At the granularity of brand-item_category (e.g., itemi could be


Sunset-Milk"), for the following rule template,

List the frequent k-itemset for the largest k. Note: do not


print any rules.

Answer:
k = 3 and the frequent 3-itemset is:
{
(Wonder-Bread, Dairyland-Milk, Tasty-Pie),
(Wonder-Bread, Sunset-Milk, Dairyland-Cheese)
}

Page 12 of 12