Code No: 2321204

Set No. 1

III B.Tech II Semester Regular Examinations, April/May 2009 DATA WAREHOUSING AND DATA MINING (Information Technology) Time: 3 hours Max Marks: 80 Answer any FIVE Questions All Questions carry equal marks ⋆⋆⋆⋆⋆ 1. (a) Explain the major issues in data mining. (b) Explain the three-tier datawarehousing architecture. 2. (a) Briefly discuss the data smoothing techniques. [8+8]

(b) Explain about concept hierarchy generation for categorical data. 3. (a) Explain the syntax for Task-relevant data specification.

(b) Explain the syntax for specifying the kind of knowledge to be mined.

4. (a) Write and explain the basic algorithm for Attribute-oriented induction.

(b) What are the differences between concept description in large data bases and OLAP? [8+8] 5. Explain the Apriori algorithm with example.

6. (a) Why is tree pruning useful in decision tree induction? What is a draw back of using a separate set of samples to evaluate pruning?

7. (a) What are the categories of major clustering methods? Explain. (b) Explain about outlier analysis. [6+10]

8. An e-mail database is a database that stores a large number of electronic mail messages. It can be viewed as a semistructured database consisting mainly of text data. Discuss the following. (a) How can such an e-mail database be structured so as to facilitate multidimensional search, such as by sender, by receiver, by subject, by time, and so on? (b) What can be mined from such an e-mail database? (c) suppose you have roughly classified a set of your previous e-mail messages as junk, unimportant, normal, or important. Describe how a data mining system may take this as the training set to automatically classify new e-mail messages or unclassified ones. [5+5+6] ⋆⋆⋆⋆⋆ 1 of 1

w w

(b) How rough set approach and fuzzy set approaches are useful for classification? Explain. [8+8]

jn . w

uw t

r o

. ld

m co
[16]

[8+8]

[8+8]

Code No: 2321204

Set No. 2

III B.Tech II Semester Regular Examinations, April/May 2009 DATA WAREHOUSING AND DATA MINING (Information Technology) Time: 3 hours Max Marks: 80 Answer any FIVE Questions All Questions carry equal marks ⋆⋆⋆⋆⋆ 1. (a) Discuss about data mining on data warehousing (b) Discuss about various types of warehouse servers for OLAP processing. [8+8] 2. Explain various data reduction techniques. 3. (a) Explain the syntax for concept hierarchy specification.

(b) Explain the syntax for specifying the kind of knowledge to be mined. 4. (a) What is Concept description? Explain.

(b) What are the differences between concept description in large data bases and OLAP? [8+8] 5. (a) Which algorithm is an influential algorithm for mining frequent item sets for Boolean association rules. Explain. (b) Discuss about association mining using correlation rules.

6. The following table consists of training data from an employee database. The data have been generalized. For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that below: Department status age salary count Sales Senior 31...35 46K....50K 30 Sales Junior 26...30 26K...30K 40 Sales Junior 31...35 31K...35K 40 Systems Junior 21...25 46K...50K 20 Systems Senior 31...35 66K...70K 5 Systems Junior 26...30 46K...50K 3 Systems Senior 41...45 66K...70K 3 Marketing Senior 36...40 46K...50K 10 Marketing Junior 31...35 41K...45K 4 Secretary Senior 46...50 36K...40K 4 Secretary Junior 26...30 26K...30K 6 Let salary be the class label attribute.

w w

.j w

uw t n

r o

. ld

m co

[16]

[8+8]

[8+8]

Given a data sample with the values “systems”, “junior;”, and “26...30” for the attributes department, status, and age, respectively, what would a naive Bayesian classification of the salary for the sample be? [16] 7. (a) Categorize major clustering methods. 1 of 2

Code No: 2321204 (b) Explain OPTICS algorithm.

Set No. 2

(c) What is an outlier? Why is Outlier mining important? Briefly discuss about statistical-based outlier detection. [4+4+8] 8. An e-mail database is a database that stores a large number of electronic mail messages. It can be viewed as a semistructured database consisting mainly of text data. Discuss the following. (a) How can such an e-mail database be structured so as to facilitate multidimensional search, such as by sender, by receiver, by subject, by time, and so on? (b) What can be mined from such an e-mail database? (c) suppose you have roughly classified a set of your previous e-mail messages as junk, unimportant, normal, or important. Describe how a data mining system may take this as the training set to automatically classify new e-mail messages or unclassified ones. [5+5+6] ⋆⋆⋆⋆⋆

w w

.j w

uw t n

r o

. ld

m co

2 of 2

Code No: 2321204

Set No. 3

III B.Tech II Semester Regular Examinations, April/May 2009 DATA WAREHOUSING AND DATA MINING (Information Technology) Time: 3 hours Max Marks: 80 Answer any FIVE Questions All Questions carry equal marks ⋆⋆⋆⋆⋆ 1. (a) Explain about advance database systems and advance database applications. (b) Draw the integrated OLAM and OLAP architecture. Explain. [8+8]

2. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46, 52,70. (a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of the technique for the given data. (b) How might you determine outliers in the data? (c) What other methods are there for data smoothing?

3. (a) Briefly discuss about Task-relevant data specification.

4. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70. (a) What is the mean of the data? (b) What is the median? (c) What is the mode of the data? Comment on the data’s modality. (d) What is the mid range of the data? (e) Can you find (roughly) the first quartile(Q1),and third quartile(Q3) of the data? (f) Give the five number summaries of the data. (g) Show a box plot of the data. (h) How is the quantile-quantile plot different from a quantile plot? [16]

w w

(b) Explain the syntax for Task-relevant data specification.

.j w

uw t n

r o

. ld

m co

[16]

[8+8]

5. (a) Which algorithm is an influential algorithm for mining frequent item sets for Boolean association rules. Explain. 1 of 2

Code No: 2321204

Set No. 3
[8+8]

(b) Discuss about association mining using correlation rules.

6. The following table consists of training data from an employee database. The data have been generalized. For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that below: Department status age salary count Sales Senior 31...35 46K....50K 30 Sales Junior 26...30 26K...30K 40 Sales Junior 31...35 31K...35K 40 Systems Junior 21...25 46K...50K 20 Systems Senior 31...35 66K...70K 5 Systems Junior 26...30 46K...50K 3 Systems Senior 41...45 66K...70K 3 Marketing Senior 36...40 46K...50K 10 Marketing Junior 31...35 41K...45K 4 Secretary Senior 46...50 36K...40K 4 Secretary Junior 26...30 26K...30K 6 Let salary be the class label attribute. Given a data sample with the values “systems”, “junior;”, and “26...30” for the attributes department, status, and age, respectively, what would a naive Bayesian classification of the salary for the sample be? [16] 7. (a) What are the types of data in cluster analysis? Explain. (b) Explain about partitioning methods in detail.

8. (a) What kinds of association can be mined in multimedia data? What are the differences between mining association rules in multimedia databases versus transactional databases?

w w

.j w

uw t n
⋆⋆⋆⋆⋆

r o

. ld

m co

[8+8]

(b) How does latent semantic indexing reduce the size of the term frequency matrix? Explain. (c) Describe the construction of a multilayered web information base.[3+3+6+4]

2 of 2

Code No: 2321204

Set No. 4

III B.Tech II Semester Regular Examinations, April/May 2009 DATA WAREHOUSING AND DATA MINING (Information Technology) Time: 3 hours Max Marks: 80 Answer any FIVE Questions All Questions carry equal marks ⋆⋆⋆⋆⋆ 1. (a) Draw and explain the architecture for on-line analytical mining. (b) Briefly discuss the data warehouse applications. 2. Briefly discuss the Discretization and concept hierarchy techniques. [8+8] [16]

3. The four major types of concept hierarchies are: schema hierarchies, set-grouping hierarchies, operation-derived hierarchies, and rule-based hierarchies. (a) Briefly define each type of hierarchy. (b) For each hierarchy type, provide an example. 4. Write short notes for the following in detail: (a) Attribute-oriented induction.

(b) Efficient implementation of Attribute-oriented induction.

5. (a) Explain the basic concept of Association rule mining and a road map of it. (b) Briefly explain about Constraint based Association mining. [8+8] 6. (a) Write an algorithm for k-nearest neighbor classification given k and n, the number of attributes describing each sample.

7. (a) Given the following measurement for the variable age: 16, 25, 28, 46, 29, 44, 38, 37, 54, 27 Standardize the variable by the following: i. Compute the mean absolute deviation of age. ii. Compute the Z-score for the first four measurements. (b) Explain clustering using representatives algorithm with example. (c) Write an algorithm for DBSCAN and give an example of DBSCAN.[4+4+4+4] 8. (a) What are different approaches for similarity-based retrieval in image databases? (b) Define similarity search. Explain similarity search in time-series analysis. (c) Write a note on mining the World Wide Web. ⋆⋆⋆⋆⋆ 1 of 1 [4+6+6]

w w

(b) What is linear regression? Give an example of linear regression using the method of least squares. [8+8]

jn . w

uw t

r o

. ld

m co

[16]

[8+8]

Sign up to vote on this title
UsefulNot useful