You are on page 1of 7

Data Mining Paper Solution

Question 1:
1) Solution
Knowledge discovery as a process consists of an iterative sequence of the following steps:

1. Data cleaning:

 It can be applied to remove noise and correct inconsistencies in the data.

2. Data integration:

 Data integration merges data from multiple sources into a coherent data store, such as a data
warehouse.

3. Data selection:

 where data relevant to the analysis task are retrieved from the database.

4. Data transformation:

 where data are transformed or consolidated into forms appropriate for mining by performing
summary or aggregation operations.
 example, normalization may improve the accuracy and efficiency of mining algorithms
involving distance measurements.

5. Data mining:

 an essential process where intelligent methods are applied in order to extract data patterns.

6. Pattern evaluation:

 to identify the truly interesting patterns representing knowledge based on some interestingness
measures.

7. Knowledge presentation:

 where visualization and knowledge representation techniques are used to present the mined
knowledge to the user.
DIAGRAM:
2) Solution
Snowflake schema

The snowflake schema is a variant of the star schema. Here, the centralized fact table is connected
to multiple dimensions. In the snowflake schema, dimension are present in a normalized from in
multiple related tables. The snowflake structure materialized when the dimensions of a star schema
are detailed and highly structured, having several levels of relationship, and the child tables have
multiple parent table. The snowflake effect affects only the dimension tables and does not affect
the fact tables.

Fact constellation
Fact Constellation is a schema for representing multidimensional model. It is a collection of
multiple fact tables having some common dimension tables. It can be viewed as a collection of
several star schemas and hence, also known as Galaxy schema. It is one of the widely used schema
for Data warehouse designing and it is much more complex than star and snowflake schema. For
complex systems, we require fact constellations.

Starnet Query Model 


A Starnet Query Model for Querying Multidimensional Databases. The querying of
multidimensional databases can be based on a starnet model. A starnet model consists of radial
lines emanating from a central point, where each line represents a concept hierarchy for a
dimension.

Question 2: Solution
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From our table:

Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient = 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) –
4862]]]

= 0.5298

The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which
means the variables have a moderate positive correlation

Question 3: Solution

Step-1: Scan DB once, find frequent 1-itemset (single item in itemset)

Step-2: As minimum threshold support = 40%, So in this step we will remove


all the items that are bought less than 40% of support or support less than 2.
Step-3: Create a F -list in which frequent items are sorted in the descending
order based on the support.

Step-4: Sort frequent items in transactions based on F-list. It is also known as


FPDP.

Step-5: Construct the FP tree

 Read transaction 1: {B,P} -> Create 2 nodes B and P. Set the path as null -> B
-> P and the count of B and P as 1 as shown below :

 Read transaction 2: {B,P} -> The path will be null -> B -> P. As transaction 1
and 2 share the same path. Set counts of B and P to 2.
 Read transaction 3: {B,P,M} -> The path will be null -> B -> P -> M. As
transaction 2 and 3 share the same path till node P. Therefore, set count of B
and P as 3 and create node M having count 1.

 Continue until all the transactions are mapped to a path in FP-tree.

Step-6: Construct the conditional FP tree in the sequence of reverse order of F -


List {E,M,P,B} and generate frequent item set. The conditional FP tree is sub
tree which is built by considering the transactions of a particular item and then
removing that item from all the transaction.
The above table has two items {B , P} that are bought together frequently.

As for items E and M, nodes in the conditional FP tree has a count(support) of 1


(less than minimum threshold support 2). Therefore frequent itemset are nil. In
case of item P, node B in the conditional FP tree has a count(support) of 3
(satisfying minimum threshold support). Hence frequent itemset is generated by
adding the item P to the B.

Question 4: Solution

For attribute a1, the corresponding counts and probabilities are:

You might also like