You are on page 1of 4

DATA MINING AND DATA WAREHOUSING

QUESTION BANK
Module 1 and Module 2
Sl No. Questions Marks
1 Why do many enterprises need a data warehouse? 4
2 What are OLTP and OLAP database syatems? 4
3 What is ODS and what is is used for ? 4
4 Explain why ETL must deal with dirty data when extracting information from the source 8
systems.
5 List the major steps involved in the ETL process 6
6 What is the need for a separate database for decision makers? 4
7 What is a data warehouse and how it might be defined? 4
8 What are the likely benefits of building an enterprise data warehouse? 6
9 What is the major difference between the star schema and the snowflake schema? 8
10 List some differences between an OLTP system and a data warehouse system. 7
11 Describe the features of a data warehouse. 6
12 What is OLTP database system? 8
13 What is an ODS used for? How does it differ from an OLTP system 7
14 Give three most important guideline in implementing a data wartehouse for a large enterprise. 7
15 Give two major components of any data warehouse system. 8
16 What ETL? 4
17 Give two reasons for the dirty data being extracted from source systems? 7
18 List four steps of the ETL process. 8
19 Define the terms star schema and snowflake schema. 10
20 What types of queries do managers need to pose to the enterprise’s database systems? 8
21 Describe the type of metadata that is maintained in a data warehouse. 8
22 What are the major differences between OLTP and a data warehouse system? 10
23 Explain the star scheme technique of modeling a data warehouse. 8
24 What are the type of metadata that is maintained in a data warehouse. 8
25 What are the dimensions, members, measure and fact table? 7
26 What is OLAP? 4
27 List the characterstics of OLAP systems. 4
28 List some of the motivations for using OLAP. 6
29 Expalin multidimensional view and a data cube. 8
30 What are the different implementations of a data cube? 8
31 What are the differences between ROLAP and MOlAP. 10
32 Describe the operations roll-up, drill-down, slice and the dice and pivot. 10
33 List some guidelines for implementations OLAP. 8
34 What OLAP softaware is available in the market? 6
35 List four types of aggregate queries that are possible with two variables. 7
36 What are dimension? 4
37 What is a measure? 4
38 What is fact and fact table? 6
39 Give a Simple definition of OLAP. 7
40 List two major characterstics of OLAP. 5
41 Define data cube in your own words. 7
42 Show how a data cube of two dimensions looks like. 7
43 Give a simple data cube implenetation. 8
44 Are all data cube entries non-zero? If not, why not? 8
45 What is the differences between roll-up and Pivot? 10

th
B.E 6 Semester Information Science 1
46 What is the difference between drill-down and slicing? 10

Module 2:
Sl Questions Marks
No.
1. What is data mining 5
2.
3. Mention Data mining functionality, classification, prediction, clustering & evolution 5
analysis?
4. What are the challenges in methodology of Data Mining technology? 5
5. Discuss issues to consider during Data Mining? 5
6. What defines a Data Mining Task Explain at least 5 primitives? 5
7. What is knowledge discovery? 5
8. Explain the motivating challenges in development of data mining. 5
9. Explain with example the data mining tasks 10
10 What is a data? What do you mean by quality of data? 4
11 What is a data set? Explain the various types of data sets 10
12 What is data preprocessing?
13 Explain the following 5 marks
i. Aggregation each
ii. Sampling
iii. Dimensionality reduction
iv. Feature subset selection
v. Feature creation
vi. Discretization and binarization
vii. Variable transformation
Give example
14 Explain the similarity and dissimilarity between 2 objects 6
15 What is Ecludian distance? Write the generalized Minkowski distance metric for 8
various values r.
16 Explain the properties of Ecludian distance. 6
17 What is simple matching coefficients and Jaccard coefficient? Explain with examples 8
18 What is meant by cousine similarity? Explain with example. 6
19 What is Bregman divergence? 5
20 What are the issues related to proximity measures? 10
21 Discuss on selection on right proximity measures 7

Module 3:
1. What is Apriori algorithm? 5
2. Explain the association rule Mining? 5
3. What is more efficient method for Generalizing association rule explain? 5
4. Suppose that the following table is derived by attribute-oriented induction.

Class Birth_place count

Canada 180 10
Programmer others 120
Canada 20
DBA others 80

a. Transform the table into crosstab showing the associated t-weights and d-
weights.
b. Map the class Programmer into a (bidirectional)Quantitative descriptive rule,
for example, VX, Programmer(X) (birth_place (X)<=>”Canada”^…) [t:x%,
d:y%]…V(…) [t:w%,d:z%].
5. Suppose that the data for analysis includes the attribute age. The age values for the 10
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20,21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 36, 40, 45, 46, 52, 70.

a. What is the mean of the data? What is the median?


b. What is the mode of the data? Comment on the data’s modality
(i.e., bimodal, trimodal, etc.)
c. What is the midrange of the data?
th
B.E 6 Semester Information Science 2
d. Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
e. Give the five-number summary of the data.
f. Show a boxplot of the data.
g. How is a quantile-quantile plot different from a quantile plot?
6. A database has four transactions. Let min_sup=60% and min_conf=80% 10

TID date items_bought


100 10/15/99 {K, A, B, D}
200 10/15/99 {D, A, C, E, B}
300 10/19/99 {C, A, B, E}
400 10/22/99 {B, A, D}

a. Find all frequent items using apriori & FP-growth, respectively. Compare the
efficiency of the two meaning process.
b. List all of the strong association rules (with support s and confidence c)
matching the following metarule where X is a variable representing
customers, and item i denotes variables representing items (e.g., “A”, “B”,
etc.):
Vx Є transactions, buys(X,item1) ^ buys(X,item2) =>buys(X,item3)[s,c]
7. Prove that each entry in the following table correctly characterizes its corresponding 10
rule constraint for frequent item set mining

Rule Constraint Antimonotone Monotone Succinct

a. v Є S No Yes Yes
b. S C V Yes No Yes
c. min(S)≤v No Yes Yes
d. range(S) ≤v Yes No No
e. variance(S) ≤v convertible convertible No

Module-4 and 5

1. Define classification. Explain the purposes of using a classification model 6


2. Explain the general approach for building a classification model. 10
3. What is a decision tree? How a decision tree works? 10
4. Explain Hunts algorithm for inducing decision trees 10
5. What are the various methods for expressing attribute test conditions? Explain with 12
examples
6. Explain the measures that can be used to determine the best way to split the record. 12
7. Explain decision tree induction algorithm 10
8. What are the various characteristics of decision tree induction? 12
9. Explain the rule based classifier with an example 5
10. Explain how a rule based classifier works with a suitable example 6
11. Discus rule based ordering scheme and class based ordering scheme 10
12. Explain the direct methods of extracting the classification rules 8
13. Explain the indirect methods for rule extraction 8
14. What are the characteristics of rule based classifiers 10
15. Explain the Nearest-Neighbor classifier 6
16. Discus the k-nearest neighbor classification algorithm 8
17. Explain the characteristics of Nearest-Neighbor classifiers 8
1. How do you compute dissimilarities in variables? 5
2. What is clustering briefly describe the following approaches to clustering methods, 5
partition method, model base method?
3. Why is Outlier Mining important? 5
4. Explain statistical based, distance based, deviation based outlier detection ? 5
5. Briefly outline how to compute the dissimilarity between object described by the
following types of variables:
a. Asymmetric binary variables
b. Normal variables
c. Ratio-scaled variables
d. Numerical (interval-scaled) variables
6. Given the following measurement for the variable age:
18, 22, 25, 42, 28, 43, 33, 35, 56, 28
Standardize the variables by the following:
a. Compute the mean absolute deviation for age.
b. Compute the Z-score for the first four measurements.

1. What is spatial data mining? 5


2. What is multimedia Data Mining? 5
3. What is Web usage Mining? 5
4. What are the differences between no coupling, loose coupling, semi tight coupling & 5
tight coupling?
5. Difference between row scalability & column scalability? 5
6. Difference between direct query answering & intelligent query answering with an 5
example?
7. What are the trends in Data Mining? 5
8. Suppose that you are in the market top purchase Data Mining System. 10
a. Regarding the coupling of a Data Mining System with a database and/or
data warehouse system, what are differences between no coupling, loose
coupling, semi tight coupling, & tight coupling?
b. What is the difference between row scalability & column scalability?
c. Which feature (S) from those listed above would you look for when scaling a
Data Mining system?
9. General-purpose computers & domain-independent relation database system have 10
become a large market in the last several decades. However, many people feel that
generic Data Mining Systems will not prevail in the Data Mining Market. What do you
think? for Data Mining should we focus our efforts on developing domain-independent
Data Mining tools or on developing domain-specific Data Mining Solutions? Present
your reasoning.

th
B.E 6 Semester Information Science 4

You might also like