This action might not be possible to undo. Are you sure you want to continue?

COURSE OBJECTIVE: Data Warehousing and Data Mining are advanced recent developments in database technology which aim to address the problem of extracting information from the overwhelmingly large amounts of data which modern societies are capable of amassing. Data warehousing focuses on supporting the analysis of data in a multidimensional way. Data mining focuses on inducing compressed representations of data in the form of descriptive and predictive models. This course gives students a good overview of the ideas and the techniques which are

behind recent developments in the data warehousing. It also makes students understand On Line Analytical Processing (OLAP) , create data models, work on data mining query languages, conceptual design methodologies, and storage techniques. It also helps to identify and develop useful algorithms to discover useful knowledge out of tremendous data volumes. Also to determine in what application areas can data mining be applied. this course also speaks about

research oriented topics like web mining, text mining are introduced for discussing the scope of research in this subject. At the end of the course the student will understand the difference between a data warehouse and a database and also will be capable to implement the mining algorithms practically. SYLLABUS:

III Year IT II Semester DATA WAREHOUSING AND MINING UNIT-I Introduction: Fundamentals of data mining, Data Mining Functionalities, Classification of Data Mining systems, Major issues in Data Mining, Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, Further Development of Data Cube Technology, From Data Warehousing to Data Mining, Data Preprocessing: Needs Preprocessing the Data, Data Cleaning, Data Integration and Transformation, Data Reduction, Discretization and Concept Hierarchy Generation Online Data Storage. UNIT-II Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse implementation, Further Development of Data cube technology, from data warehousing to data mining. Data cube computation and data generalization: efficient methods for data cube computation, Further Development of Data cube and OLAP Technology, Attribute –Oriented induction. UNIT-III Mining frequent patterns , association and correlation: Basic concepts, Efficient and scalable frequent itemset mining methods, Mining various kinds of association rules, From association mining to correlation analysis, Constraint – based association mining. UNIT – IV Classification and prediction: Issues regarding classification and prediction, Classification by Decision tree introduction, Bayesian Classification, Rule-based classification, Classification of Back propagation, Support vector machines, Associative Classification, Lazy learners, Other classification methods, Prediction, Accuracy and error measures, Evaluating the accuracy of classifier or predictor, Ensemble methods- increasing. UNIT – V

Cluster analysis introduction: Types of Data in cluster analysis, A Categorization of major clustering methods, partitioning methods, Hierarchical methods, Density – Based methods, Grid – based methods and model based clustering methods, Clustering high dimensional data, Constraint based cluster analysis, Outlier analysis. UNIT – VI Mining streams, time series and sequence data: Mining data streams, Mining time – series and sequence data , Mining sequence patterns in transactional databases, Mining sequence patterns in biological data , Graph mining, Social network analysis, Multirelational data mining. UNIT – VII Mining objects, spatial multimedia , text, and web data: Multidimensional analysis and descriptive mining of complex data objects, Spatial data mining, Multimedia data mining , Text Data mining, Mining the world wide web. UNIT – VIII Application and trends in data mining: Data mining application, Data mining system products and research prototypes, Additional themes on data mining , Social impacts of data mining.

SUGGESTED BOOKS: (i) TEXT BOOKS: T1. Jiawei Han & Micheline Kamber. “Data Mining – Concepts and Techniques”, Kaufman Publishers

**T2. Tan, Pang-Ning and other “Introduction to Data Mining” Pearson Education, 2006.
**

(ii) REFERENCES: 1. Arun K Pujari, “Data Mining Techniques”, University Press. 2. Sam Anahory & Dennis Murray, “Data Warehousing in the Real World”, Pearson Edn Asia. 3. K.P.Soman, “Insight Into Data Mining”, PHI 2008. 4. Paulraj Ponnaiah, “Data Warehousing Fundamentals”, Wiley Student Edition. 5. Ralph Kimball, “The Data Warehouse Life cycle Tool kit”, Wiley Student Edition.

Dimensionality Reduction Data Compression. R4:Pg12.5.2. Data Transformation Data Cube Aggregation.1. Numerosity Reduction For Numeric data.2 T1: 3.6 . R3:5. Noisy Data.2 R3:2.1 Data warehouse Architecture 14 Data warehouse implementation Further Development of Data cube Technology From data warehousing to data mining Efficient methods for data cube computation 15 16 17 18 .1-.4.5 T1: 4.8. Inconsistent Data Data Integration.6.4.1. R5: pg -6 T1:3. Primitives for specifying a data mining task Integration of data mining system with a database or a data warehouse system.1-3 T1: 3.2 T1: 1. For full cube BUC. Star cubing for fast high dimensional OLAP.1.2 T1: 2. For categorical data Revision of Unit-I UNIT – II What is datawarehouse and its importance T1: 2.1-2.R3:2. Cubes with iceberg conditions T1:3.2 T2:3.1. Data mining task primitives Integration of data mining system with a database or a data warehouse system.1.3 R4:pg-57 T1: 4.1-5. T2:1. T1: 2.4 T1: 2.5. 1. Stars.1 R4:pg 2-7 T1: 1. Meta data Repository.1-2.T2:2.7 What kinds of patterns can be mined Data mining as a confluence of multiple disciplines.2 R3:5.3.3.2-6 T1: 3.1-3.5. R2:5. R2:1. R5:pg-13 R2:3.4.2 T1: 2.1-1.4.4.3. Data warehouse Back-end tools Further Development of Data cube Technology Data warehouse usage From OLAP to OLAM Different kinds of cubes.4.1.3 Data warehouse and OLAP technology for data mining: Data Warehouse Multidimensional data model 12 13 From tables and spread sheets to data cubes.1.1-3.4. 5 T1: 1.9 6 7 8 9 10 11 Need preprocessing the data Missing values. snowflakes and fact constellations schemas for multidimensional databases Steps for design and construction of data warehouse A three –tier data warehouse architecture Types of OLAP servers Efficient computation of data cubes Indexing OLAP data.1-4.1-3.SESSION PLAN: Subject : Data Warehousing and Mining Year/ Sem: IT III/II Topics in each unit as per JNTU Syllabus Introduction Fundamentals of data mining Data mining functionalities Classification of Data mining system.6 T2:2.3.5.3. Major issues in data mining Data Preprocessing: Need preprocessing the data Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept hierarchy generation Lectu re No 1 2 3 4 Modules / Sub modules for each topic UNIT-I Overview of all units and course delivery plan Data mining fundamentals Text Books / Reference Books Syllabus book T1: 1.6.2 T2:2.3 R2:4.3.2.2.

association and correlation: Basic concepts Efficient and scalable frequent itemset mining methods 22 Attribute oriented induction.1. Complex Aggregation at multiple granularities .1-4 T2:4.7 R3:4.Mining class comparison. Mining frequent itemsets without candidate generation Mining frequent itemsets using vertical data format Mining closed frequent itemsets Multilevel association rules. Class description Revision of Unit-II UNIT – III Market basket analysis .3.2 28 29 Bayesian Classification 30 T1: 6.2 T1: 5. Efficient T1: 4.11. R1:6. Naive Bayesian classification T1: 6.Other development T1: 4.2.1 T1: 6.1.4. Rough set approach.6.1-5. Defining a network topology Back propagation. Rule extraction from a decision tree. closed item sets and Association Rule Frequent pattern mining The Apriori algorithm .2.3 R1:6.2.Frequent item sets. 6.2.10.3 Lazy learners Other classification methods Prediction 34 35 K-nearest neighbor classifiers . R3:7.1-5 T2:6.Case-based reasoning Genetic algorithms. 20 21 Mining frequent patterns .3 networks R3:9. Generating association rules from frequent item sets Improving the efficiency of apriori.13 T1: 6.Further Development of Data cube Technology Attribute oriented induction 19 Discovery – Driven Exploration of data cubes.3-6.3. Non-liner regression .1. Back propagation.1. Attribute selection measures Tree pruning . Fuzzy set approaches Linear and multiple regression.6.5.3 T2:5.5 T1: 5.1.3.6 T2:6.1.9.2.2 T2:4.1. implementation of attribute oriented induction Presentation of the derived generalization .1.1.8 T2:5. Training Bayesian Belief T2:5.3 T2:6.2.2.2.4.2.1.1.2 R1:4.2 23 24 Mining various kinds of association rules From association mining to correlation analysis Constraint – based association mining Classification and prediction: Issues regarding classification and prediction Classification by Decision tree introduction 25 26 27 T1: 5. Linearly inseparable Classification by association rule analysis T1: 6.1-4 Bayesian Belief Networks.1-3 T2:5.2 Using If – then rules for classification.7. Mining multidimensional Association rules from relational databases and data warehouses Strong rules are not necessarily interesting From association analysis to correlation analysis Meta rule – Guided mining of association rules Mining guided by additional rule constraints UNIT – IV Preparing the data for classification and prediction Comparing classification methods Decision tree induction.1.1-3. Scalability and decision tree induction T1: 5.4.8 R3:7.2 T1: 6.2-3 R1:4.1-4 Rule-based classification 31 Classification of Back propagation Support vector machines Associative Classification 32 33 T1: 6.2. Rule induction using a sequential covering algorithm A Multilayer feed-forward neural network.1 T1: 5. Back propagation and interpretability Linearly separable.1 T1: 6.5 R3:10.7 Bayes theorem.3 T1:5.5.1.

1.1.7. hidden Markov T1: 8.9.8.2.3 validation.2. data systems.8. Cross.3 clustering methods Clustering with obstacle objects.5. Boosting T1: 6. Semi supervised cluster analysis Statistical distribution based.1.2. Vector objects 39 Cluster analysis introduction: Types of Data in cluster analysis.3 variant and constrained substructure patterns applications Density – Based methods Grid – based methods and Model based clustering methods Clustering high dimensional data Constraint based cluster analysis Outlier analysis 42 43 44 45 46 47 Mining streams.13 Chameleon R3:11.3 R1:5.2 R1:5.3.5.2.6. Stream OLAP and stream data cubes Frequent – patterns mining in data streams.1.1.7. constraint based periodicity analysis for time related data alignment of biological sequences.6.T1: 6. mining.1. Binary variables Nominal.4.1. Frequent pattern – based T1: 7.4. Wave cluster T1: 7.1. R3:11.2.13.1. Partitioning methods T1: 7. Clustering evolving data streams Trend analysis .increasing the accuracy 36 37 38 Other regression models Classifier accuracy measures. 11.1-4 in large databases T2:8.12.2 measures Holdout method and random subsampling.4.3 OPTICS . User constraint T1: 7.5 R3:4.7 STING . DENCLUE T2:8.9.1.3 Agglomerative and divisive .4.2.5 Revision of Unit-III & IV UNIT – V Interval – scaled variables. mining T1: 9. BIRCH.7.2 model methods for mining frequent subgraphs . T1: 7. PROCLUS.1.1. Distance based.3 cluster analysis.10. and ratio scaled variables Variables of mixed types. Bootstrap T2:4. A Categorization of major clustering methods Partitioning methods Hierarchical methods 40 T1: 7.5 DBSCAN: A Density-based clustering method . Classification of dynamic data streams.11 sequential pattern mining.1.9.1.2 analysis T2:9. Density T1: 7.1-8.1-7.2. scalable methods for T1: 8. Predictor error T1: 6.2.1. ordinal.1-4 based.1-4. R1:5.11.14. Conceptual clustering T1: 7.1 Bagging . Similarity search in time-series T1: 8.2.2.Accuracy and error measures Evaluating the accuracy of classifier or predictor Ensemble methods.3. R1: 5. 7.1.6.3 Neural network approach CLIQUE. Deviation based Revision of Unit-V UNIT – VI Methodologies for stream data processing and stream T1: 8.1.3 T2:8.2. ROCK.2 41 Classical partitioning methods.8. time series and sequence data: Mining data streams 48 Mining time – series and sequence data Mining sequence patterns in transactional databases Mining sequence patterns in biological data Graph mining 49 50 51 52 .2 T2:5.4.2 Expectation maximization.

4. For biological data analysis.spatial multimedia . Text mining Mining the web‟s link structures to identify Automatic classification of web documents Construction of a multilayered web information base. classification and trend analysis. Aggregation and T1: 10.Social network analysis Multirelational data mining 53 54 what is social networks characteristics.1 –5.3. T1: 9.4. Mining Raster databases Similarity search in multimedia data.1. 11.5.12-14 R3:12.1 –5. Kaufman Publishers T2.1. Classification and prediction analysis . (ii) REFERENCES: .3.6 T1: 11. Tuple ID propagation . Multidimensional analysis .1-3 R1:8.1-4. Mining association in multimedia data Text data analysis and information retrieval Dimensionality reduction.1 T1: 10. “Data Mining – Concepts and Techniques”. text.2.1.5 Application and trends in data mining: Data mining application 62 63 T1: 11.1-5 R1:8.1. R1:9.1.2. Web usage mining UNIT – VIII For financial data analysis. Pg: 607-613 Multimedia data mining 59 Text Data mining Mining the world wide web 60 61 T1: 10.1-5.2 Data mining system products and research prototypes 64 65 66 67 Books Referred by faculty: (i) TEXT BOOKS: T1. Pang-Ning and other “Introduction to Data Mining” Pearson Education. Multirelational classification Multirelational clustering Revision of Unit-VI T1: 9. clustering methods. link mining mining on social networks what is multirelational datamining? ILP approach. 55 Mining objects. For retail industry For telecommunication industry.5.3-6 57 Spatial data mining 58 T1:10. and web data: Multidimensional analysis and descriptive mining of complex data objects 56 UNIT – VII Generalization of structured data . Other scientific applications.1. Generalization of object identifiers and class Generalization of class composition hierarchies Construction and mining of object cubes Generalization based mining of plan databases by divide and conquer Spatial data cube construction and spatial OLAP Spatial association analysis.2-8.2 R1:3. For intrusion detection How to choose a data mining system Examples of commercial data mining systems Revision of Unit-VII & VIII Subject Summary T1: 10. Tan.11 T1: 11.3.6 T1: 10.2.1-3 approximation in spatial and multimedia data generalization . Jiawei Han & Micheline Kamber. 2006.2.1 T1: 11.

and detailed descriptions of significant applications. Paulraj Ponnaiah. Pattern Recognition. search techniques. http://www.fr/bernard. surveys and tutorials of important areas and techniques.uk/tec/courses/Data Mining /ohp/dm-OHP-final_1. University Press.Soman.the-data-mine. . applications and implications in the field of intelligent systems. content-based retrieval of image and video. Natural Language Processing and Fusion. http://pagesperso-orange. Knowledge Discovery Process. Evolving Clustering Methods. Supervised Semi-Supervised and Unsupervised Learning. Intelligent Agents and Multi-Agent Systems.ac. 2.lupin/english/: Provides information about OLAP 5. Ralph Kimball. Data Mining Methods. Association Analysis. development. The focus of the journal is on high quality research that addresses paradigms. Intelligent System Design. face and gesture recognition and relevant specialized hardware and/or software architectures are also covered. Wiley Student Edition. Knowledge Management and Representation.com/bin/view/Misc/IntroductionToData Mining : Links about different journals of Data Warehousing and Mining. 4. .blogspot.sfu. www. K. 2. 5. Data Mining. Cluster analysis which gives an exposure to students and also they come to know that how these techniques can be used in the real world applications. www. 4. document and handwriting analysis. and selected areas of machine intelligence. IEEE Transactions on Knowledge and Data Engineering: This Journal has papers published on the data mining techniques like classification. Its Coverage includes: Theory and Foundational Issues. Areas of such machine learning. Wiley Student Edition. Arun K Pujari. Journal of Intelligent Systems : This journal provides readers with a compilation of stimulating and up-to-date articles within the field of intelligent systems.1. “Data Mining Techniques”. Data Mining and Knowledge Discovery: The journal publishes original technical papers in both the research and practice of data mining and knowledge discovery. WEBSITES: 1. video and image sequence analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence: This includes all traditional areas of computer vision and image understanding. Sam Anahory & Dennis Murray. . Swarm Intelligence.com 3. all traditional areas of pattern analysis and recognition. “The Data Warehouse Life cycle Tool kit”. Pearson Edn Asia. “Data Warehousing Fundamentals”.pcc. “Data Warehousing in the Real World”. Bayesian Learning. Links to Books for Data Mining 2. Decision Support Systems. http://Data Mining warehousing. and Application Issues. Prediction. . 3. “Insight Into Data Mining”. Evolutionary algorithms. PHI 2008. medical image analysis. Algorithms for Data Mining.ca/~han/dmbook : Slides of text book JOURNALS: 1. 3.P.html: Online notes for DM 4. The list of topics spans all the areas of modern intelligent systems such as: Artificial Intelligence.qub.cs.

M.J.W.STUDENT SEMINAR TOPICS: 1. V. Vol. Koutsonikola.No. J.M.G. Pg.No. “Time-Aware Web Users Clustering”. Changshui Zhang. “Automatic Website Summarization By Image Content”.Based Method for Discretization”. No. “IDD: A Supervised Interval Distance . FU & J.No. Ruiz.Li. May 2008.1195. A. Pg. Pg. C. “Integrating Data Warehouses With Web Data”. January 2008.No. 2. Vol. J¨urgen Hofer and Peter Brezany 8.R. 5. R.Pedersen.I.20. Vol.Wong. “Distributed Decision Tree Induction within the Grid Data Mining Framework GridMiner -Core”. 4.940.No. No. Sep 2008. A. Lei Chen. Papadimitriou. S.J.653.20. J.B. Perez. Petridou.Pei. September 2008.I.20. January 2008. 7. Aramburu &T.55.No. Angulo & N. Vakali & G.No.A. Pg. F.-W. Pg. July 2008. 1230. “Label Propagation through Linear Neighborhoods” Fei Wang.C. Pg.20. 3. Berlanga. “Distributed Data Mining in Credit Card Fraud Detection” Philip K Chan .40. Xiang Lian. 6. Vol. “Efficient Similarity Search Over Future Stream Time Series”.No. 20. 20. Vol. Agell.No. Vol.

(c) Mean. the age values for the data tuples are increasing order 13 16 16 23 23 25 25 25 25 30 30 30 30 35 35 35 40 40 45 45 45 70 a) How might you determine the outliers in the data? b) What other methods are there for data smoothing? (R07 APRIL/MAY 2011) 14. and boxplots. (R07 APRIL 2011) 3. and mode. (R07 APRIL 2011) 5.a) Discuss various issues in data integration? b) Explain the concept hierarchy generation for categorical data? (R07 APRIL/MAY 2011) 17. (b) Briefly discuss about data transformation. (R07 APRIL 2011) 9. Write short notes on: i) Discriminating different classes ii) Statistical measures in large databases. List the statistical measures for the characterization of data dispersion. (R07 APRIL 2011) 6. (a) Explain about the graph displays of basic statistical class description. (a) Concept description (b) Variance and Standard deviation. and discuss how they can be computed efficiently in large data bases? (R07 APRIL/MAY 2011) . (b) Differentiate operational database systems and data warehousing. (a) Explain data mining as a step in the process of knowledge discovery. (R07 APRIL 2011) 7. (a) Briefly discuss about specifying the kind of knowledge to be mined.a) Why is it important to have a data mining query language? b) Define schema and operation-derived hierarchies? (R07 APRIL/MAY 2011) 18. (R07 APRIL 2011) 4. (b) Discuss the role of Numerosity reduction in data reduction process in detail. (R07 APRIL 2011) 2. (R07 APRIL/MAY 2011) 13. List and describe the primitives for the data mining task? (R07 APRIL/MAY 2011) 15. (R07 APRIL 2011) 12. (a) Justify the role of data cube aggregation in data reduction process with an example. (R07 APRIL 2011) 8. (a) Describe why is it important to have a data mining query language. Discuss about the role of data integration and transformation in data preprocessing.a) Explain the storage models of OLAP? b) How does the data warehousing and data mining work together. (b) Briefly discuss about the architectures of data mining systems. outliers. (R07 APRIL 2011) 10. Suppose that the data for analysis includes the attribute age. Explain the syntax for the following data mining primitives: (a) Task-relevant data (b) The kind of knowledge to be mined (c) Background knowledge (d) Interestingness measures. List and describe the various types of concept hierarchies? (R07 APRIL/MAY 2011) 19.QUESTION BANK: UNIT-I 1. (b) Explain the syntax for specifying the kind of knowledge to be mined. (a) Briefly discuss about data integration. median. (d) Quartiles. (b) Discuss issues to be considered during data integration process. (b) Briefly explain about the presentation of class comparison descriptions. Explain the following terms in detail. (a) Briefly explain about the forms of Data preprocessing. (R07 APRIL 2011) 11. (R07 APRIL/MAY 2011) 16. Discuss about primitives for specifying a data mining task.

(a) Briefly discuss about data integration. Write short notes on the following: (a) Association analysis (b) Classiffication and prediction (c) Cluster analysis (d) Outlier analysis. (b) Give a detail note on trend analysis. (JNTU Dec 2010) 32. Briey discuss the following data mining primitives: (a) Task-relevant data (b) The kind of knowledge to be mined (c) Interestingness measures (d) Presentation and visualization of discovered patterns. (b) Why preprocessing of data is needed? (JNTU Dec-2010) 27. (JNTU Dec-2010) 26. (a) Draw and explain the architecture of typical data mining system. (b) Differentiate OLTP and OLAP. (a) List and describe any four primitives for specifying a data mining task. (a) Discuss construction and mining of object cubes. (JNTU Dec-2010) 36. (JNTU Dec-2010) 37. (NR APRIL 2011) 25. (b) Explain the architecture of a typical data mining system. (b) Explain about concept hierarchy generation for categorical data. (a) How can you go about filling in the missing val ues in data cleaning process? (b) Discuss the data smoothing techniques. (JNTU Dec-2010) 35. (b) Briey explain about concept hierarchies. (b) Describe why concept hierarchies are useful in data mining. (b) Differentiate operational database systems and data warehousing. (a) Briefly discuss the data smoothing techniques. (JNTU Dec-2010) 33. (b) Describe why concept hierarchies are useful in data mining. (a) Explain data mining as a step in the process of knowledge discovery. (JNTU Dec-2010) 34. DMQL? How to design GUI based on DMQL? (R07 APRIL/MAY 2011) 23. (b) Briefly discuss about data transformation. (JNTU Dec-2010) 28. (b) Discuss the issues regarding data warehouse architecture. (a) Explain the design and construction process of data warehouses. What are the various issues in data mining? Explain each one in detail? (R07 APRIL/MAY 2011) 21. (a) Explain the architecture of a typical data mining system. (JNTU Aug/Sep 2008) 42. (a) How can we smooth out noise in data cleaning process? Explain. (JNTU Dec-2010) 31.20. (JNTU Dec-2010) 30. (JNTU Dec-2010) 29. (b) Explain about concept hierarchy generation for categorical data. (a) What are the desired architectures for Data mining systems. (JNTU Aug/Sep 2008) 40. (a) List and describe any four primitives for specifying a data mining task. (JNTU Aug/Sep 2008) . (a) Briefly discuss the data smoothing techniques. (a) Explain the design and construction process of data warehouses. Why preprocess the data and explain in brief? (R07 APRIL/MAY 2011) 22. (JNTU Aug/Sep 2008) 38. Write short notes on GUI. Explain various data reduction techniques.(JNTU Aug/Sep 2008) 39. (JNTU Aug/Sep 2008) 41. (a) Explain data mining as a step in the process of knowledge discovery. (a) When is a summary table too big to be useful ? (b) Relate and discuss the various degrees of aggregation within summary tables. (b) Explain the architecture of a typical data mining system.a) What are the various issues relating to the diversity of database types? b) Explain how data mining used in health care analysis? (R07 APRIL/MAY 2011) 24. (b) Differentiate operational database systems and data warehousing.

(JNTU Aug/Sep 2008) 46. (b) Discuss about the objective measures of pattern interestingness. (JNTU May 2008) UNIT-II 1.grouping hierarchies. (b) The four major types of concept hierarchies are: schema hierarchies. Briefly compare and explain by taking an example of your point(s). operation-derived hierarchies. fact constellation. (b) What are the differences between concept description in large data bases and OLAP? (R07 APRIL 2011) 2. (JNTU Aug/Sep 2008) 47. (JNTU May 2008) 49. (a) Snowflake schema. (R07 APRIL 2011) 4. (R07 APRIL/MAY 2011) 6. Use an example to explain your points. (a) How can we specify a data mining query for characterization with DMQL? (b) Describe the transformation of a data mining query to a relational query. (b) Differentiate operational database systems and data warehousing. (a) What does the data warehouse provide for business analyst? Explain (b) How do data warehousing and OLAP related to Data mining? (R07 APRIL 2011) 3. data transformation. (a) Explain data mining as a step in the process of knowledge discovery. and virtual warehouse. (b) Concept hierarchies. (a) Briefly discuss about data integration. Discuss the role of data compression and numerosity reduction in data reduction process. (b) Differentiate operational database systems and data warehousing. (JNTU May 2008) 54. Write short notes for the following in detail: (a) Attribute-oriented induction. (JNTU May 2008) 52. (b) Differentiate OLTP and OLAP. (JNTU Aug/Sep 2008. (JNTU May 2008) 51. (a) Describe why is it important to have a data mining query language. refresh. (JNTU Aug/Sep 2008) 44. May 2008) 45. a) Snowflake schema. (a) Briefly discuss various forms of presenting and visualizing the discovered patterns. (a) Explain the syntax for Task-relevant data specification. Briefly define each type of hierarchy. (c) Discovery driven cube. set. (b) Measures of pattern interestingness. fact constellation b) Data cleaning. (b) Explain the syntax for specifying the kind of knowledge to be mined. (a) Draw and explain the architecture for on-line analytical mining. (a) The kind of knowledge to be mined. Write the syntax for the following data mining primitives.(JNTU May 2008) 50. Write the syntax for the following data mining primitives: (a) Task-relevant data. (a) Explain data mining as a step in the process of knowledge discovery. multifeature cube. Briefly compare the following concepts. (JNTU May 2008) 53.a) Differentiate between OLAP and OLTP? . (JNTU May 2008) 55. data transformation. (JNTU May 2008) 48.43. (b) Data cleaning. (b) Briefly discuss about data transformation. (a) What is Concept description? Explain. (b) Briefly discuss the data warehouse applications. (JNTU May 2008) 56. (a) Draw and explain the architecture of typical data mining system. (R07 APRIL 2011) 5. starnet query model. (b) Explain the three-tier datawarehousing architecture. and rule-based hierarchies. Briefly discuss the role of data cube aggregation and dimension reduction in the data reduction process. (a) Explain the major issues in data mining. (b) Efficient implementation of Attribute-oriented induction.

.Programmer(X) . (b) “Management tools are required to manage a large. (NR APRIL 2011) 13. Suppose that the following table is derived by Attribute-oriented induction. (a) Describe the server management features of a data warehouse system.. (a) How can we perform attribute relevant analysis for concept description? Explain. (JNTU Dec-2010) 16.) [ t:w%. (a) Differentiate attribute generalization threshold control and generalized relation threshold control.for example. (a) Discuss the design strategies to implement backup strategies. (JNTU Dec 2010) 15. APRIL 2011) 2. (birth place(X)= "Canada" ^ . (a) Explain the ADHOC query and Automation in Data Warehouse delivery pro-cess. Write short notes for the following in detail: (a) Measuring the central tendency (b) Measuring the dispersion of data.. What is data compression? How would you compress data using principle component analysis (PCA)? (R07 APRIL/MAY 2011) 8. (a) What are the differences between concept description in large data bases and OLAP? (b) Explain about the graph displays of basic statistical class description. (JNTU May 2008) UNIT-III 1.d:y%] . (b) Differentiate between predictive and descriptive data mining. (b) Explain the measures of central tendency in detail. (a) Explain how concept hierarchies are used in mining multilevel association rule? (R07 . (NR APRIL 2011) 12. ^ (. (JNTU May 2008) 21. (b) Explain the analytical characterization with an example. Explain the significance of tuning the Data warehouse. (b) Describe the recovery strategies of a data warehouse system. (a) What are the differences between concept description in large data bases and OLAP? (b) Explain about the graph displays of basic statistical class description. (NR APRIL 2011) 11. d:z%] (JNTU Dec-2010) 17.. Write short notes for the following in detail: (a) Measuring the central tendency (b) Measuring the dispersion of data. (a) How can we perform discrimination between different classes? Explain. (NR APRIL 2011) 14. CLASS BIRTH-PLACE COUNT Canada 180 Programmer others 120 Canada 20 DBA others 80 (a) Transform the table into a crosstab showing the associated t-weights and d-weights. Give a detail note on classiffication based on concepts from association rule mining. 8X. May 2008) 20. How is class comparison performed? Can class comparison mining be implemented efficiently using data cube techniques? If yes explain? (R07 APRIL/MAY 2011) 9. (a) Explain the basic levels of testing a data warehouse. Explain the steps involved in it. (JNTU Aug/Sep 2008) 19. dynamic and complex system such as data warehouse system" ower your explanation with justification. (JNTU Aug/Sep 2008. (b) Map the class programmer into a(Bi-directional) quantitative descriptive rule.) [t: x%..b) Draw and explain the star schema for the data warehouse? (R07 APRIL/MAY 2011) 7. (a) Write the algorithm for attribute-oriented induction. (b) How can concept description mining be performed incrementally and in a distributed manner? (JNTU May 2008) 22. (b) Explain a plan for testing the data warehouse. (JNTU Dec-2010) 18.. (b) Explain to the idea “Can we do without an Enterprise data warehouse"? (NR APRIL 2011) 10.

(R07 APRIL 2011) 4. (a) Explain Distance-based discretization. (JNTU Dec-2010) 11. [5+6+5] (R07 APRIL 2011) 5. b) Expalin association rule generation from frequent itemsets. (b) Can we design a method that mines the complete set of frequent item sets without candidate generation. May 2008) 16. (a) Discuss about Association rule mining. If yes. (a) Write the FP-growth algorithm. (b) Briefly explain about .Compare and contrast Apriori algorithm with frequent pattern growth algorithm. (a) Discuss automatic classiffication of web documents . (b) What is an iceberg query? Explain with example.(b) Give the classiffication of association rules in detail. such as “ A customer who buys a Pentium PC will buy Microsoft office within three months. (a)Which algorithms an influential algorithm for mining frequent item sets for Boolean association rules? Explain. Describe example of data set for which apriori check would actually increase the cost? (R07 APRIL/MAY 2011) 8. Explain. (JNTU Aug/Sep 2008) 13.a) How is association rules mined from large databases? b) Describe the different classifications of associate rule mining? (R07 APRIL/MAY 2011) 5. (R07 APRIL 2011) 4. (a) How sclable is decision tree induction?Eplain (b) Discuss classiffication based on concept from association rule mining. Explain the Apriori algorithm with example. (JNTU May 2008) UNIT-IV 1. (R07 APRIL 2011) 2. Design an efficient algorithm to mine multilevel sequential patterns from a transaction database. (b) Write about basic measures of text retrieval.. (b)What are additional rule constraints to guide mining? Explain.classiffication of database systems. (b) Give a detail note on iceberg queries .Sequential patterns can be mined in methods similar to the mining of association rules. (JNTU May 2008) 17. List and explain the five techniques to improve the efficiency apriori algorithm? (R07 APRIL/MAY 2011) 6.a) Discribe mining multidimensional association rule using static discretization of quantitative attribute. (JNTU Aug/Sep 2008. explain with example. Explain in detail the major steps of decision tree classiffication. (a) Discuss about Concept hierarchy. (b) What are the approaches for mining multilevel Association rules? Explain. (JNTU Dec-2010) 12. Consider a data set apply both algorithms and explain the results. Why perform attribute relevance analysis? Explain the various methods of it‟s? (R07 APRIL/MAY 2011) 7. How will you solve a classification problem using decision trees? (R07 APRIL/MAY 2011) . What is Divide and Conquer? How it could be helpful for FP Growth method in generating frequent item sets without candidate generation? (R07 APRIL/MAY 2011) 7. (JNTU Dec 2010) 9.Explain mining multilevel association rules from transaction databases. (R07 APRIL 2011) 3. (a) How can we mine multilevel Association rules efficiently using concept hierarchies? Explain. (R07 APRIL 2011) 3. (b) Give a note on neive Bayesian classiffier. (JNTU Dec-2010) 10. (R07 APRIL 2011) 6.(a) Explain how concept hierarchies are used in mining multilevel association rule? (b) Give the classification of association rules in detail. (c) Explain mining raster databases. An example of such a pattern is the following “A customer who buys a PC will buy Microsoft software within three moths” on which one may drill down to find a more refined version of the patters. (JNTU Aug/Sep 2008) 14. (a) Discuss the various measures available to judge a classiffier. (JNTU Aug/Sep 2008) 15.

20. (b) How does the Naive Bayesian classification works? Explain. Outline a data cube-based incremental algorithm for mining analytical class comparisons? (R07 APRIL/MAY 2011) 9. (b) Describe backpropagation classification. Label the nodes in the input and output layers. . Discuss about Back propagation classification. (b) Explain the hold out method for estimating classiffier accuracy. age and salary given in that below. (b) Explain how rules can be extracted from training neural networks.8. count represents the number of data tuples having the values for department. (JNTU May 2008) (a) Explain about basic decision tree induction algorithm. Design a multiplayer feed-forward neural network for the given data. (JNTU Aug/Sep 2008) 17. (c) How does tree pruning work? What are some enhancements to basic decision tree induction? (JNTU Aug/Sep 2008) (a) What is classification? What is prediction? (b) What is Bayes theorem? Explain about Naive Bayesian classification. The data have been generalized. [5+5+6] (JNTU Dec-2010) 16. May 2008) (a) Can any ideas from association rule mining be applied to classification? Explain. efficiency. (JNTU Dec 2010) 8. Can we get classification rules from decision trees? If so how? What are the enhancements to the basic decision tree? (R07 APRIL/MAY 2011) 11. The following table consists of training data from an employee database.Explain in detail the major steps of decision tree classiffication. and scalability of the classification or prediction process? (R07 APRIL/MAY 2011) 12. 18. For a given row entry. (b) Explain training Bayesian belief networks. (c) Explain classifier accuracy. What is backpropagation? Explain classification by back-propagation? (R07 APRIL/MAY 2011) 10. status. (a) Discuss the five criteria for the evaluation of classification and prediction methods. 22. (c) Discuss about k-Nearest neighbor classifiers and case-based reasoning. (c) Discuss Fuzzy set approach for classiffication. (JNTU Aug/Sep 2008) (a) Explain decision tree induction classification. (JNTU Dec-2010) 15. Explain the various preprocessing steps to improve the accuracy. (JNTU Dec-2010) 9. (a) Give a note on log-linear models. (JNTU Aug/Sep 2008. 21. (b) Write a detail note on genetic algorithms for classiffication. Let salary be the class label attribute. (JNTU May 2008) (a) Describe the data classification process with a neat diagram. (a) Explain with an example a measure of the goodness of split. 19.

(b) Discuss expectation maximization algorithm for clustering. (b) Given two objects represented by the tuples(22. (JNTU Aug/Sep 2008) 18. ii. (b) Explain OPTICS algorithm for clustering. (R07 APRIL 2011) each.(b) Discuss about Bayesian classification. (JNTU Dec 2010) (b) Discuss in detail DENCLUE clustering methods. and ratio scaled variables? (R07 APRIL/MAY 2011) 11. Compute the Manhatten distance between the two objects. (a) What major advantages does DENCLUE have in comparison with other clustering algorithms? (b) What advantages does STING offer over other clustering methods? (c) Why wavelet transformation useful for clustering? (d) Explain about outlier analysis. (a) Give a detail note on CLIQUE algorithm. What are the different types of data used in cluster analysis? Explain in brief each one with an example? (R07 APRIL/MAY 2011) 10.a) What are the differences between clustering and nearest neighbor prediction? b) Define nominal. nominal. . 42.0. z-source. 28. Why is outlier mining important? Discuss about different outlier detection approaches? Briefly discuss about any two hierarchical clustering methods with suitable examples? (R07 APRIL/MAY 2011) 9.a) What are the fields in which clustering techniques are used? b) What are the major requirements of clustering analysis? (R07 APRIL/MAY 2011) 8. (a) Discuss about binary. (b) Explain in detail constraint based association mining. (b) What are different types of hierarchical methods? Explain. Compute the Z-score for the first four measurements. 35. 22. (a) Define mean absolute deviation. (a) Explain competitive learning and self organizing feature maps methods to clustering. (a) Explain K-means algorithm for clustering. (R07 APRIL 2011) 7. ordinal. (JNTU Aug/Sep 2008) 19. (a) What are the categories of major clustering methods? Explain.8) i. Explain the correlation between these two types of analysis. (R07 APRIL 2011) (b) What is an outlier? Explain in brief outlier analysis.1. (b) Explain about outlier analysis. city block distance. Compute the Euchidean distance between the two objects. UNIT-V (JNTU May 2008) 1. (R07 APRIL 2011) 5. (JNTU Dec-2010) 14. (JNTU Dec-2010) 13. ordinal and ratio-scaled variables. (R07 APRIL 2011) 4. (R07 APRIL 2011) 6. (a) Discuss distance based outlier detection. What is association analysis? Discuss cluster analysis. (a) How are association rules mined from large databases? Explain. ii. 28 Standardize the variable by the following: i. (JNTU Dec-2010) 15. 12. (JNTU Aug/Sep 2008) 17. (b) Explain about grid-based methods.10) and(20.36. (a) Given the following measurement for the variable age: 18. 25. and minikowski distance. (JNTU Aug/Sep 2008) 16. 43. 33.42. (a) Discuss interval-scaled vari abl e and binary vari abl es. (b) Explain in detail K-Medoids algorithm for clustering. Compute the mean absolute deviation of age. 56. (b) Discuss in detail BIRCH algorithm. (a) Explain how COBWEB method is used for clustering.

(b) Give a note on item frequency matrix. Discuss how to process a descriptive mining query in such a system using a generalizationbased approach. Write short notes on: i) Data objects ii) Sequence Data Mining iii) Mining Text Databases. (JNTU Dec 2010) 9. Explain. (b) Discuss the major algorithms of the sequence mining problem.0. (b) What is web usage mining? Explain with suitable example. (a) Describe cosine measure for similarity in documents. (b) Explain construction of a multilayered web information base. (JNTU May 2008) 21. (NR APRIL 2011) 7. ii. (JNTU Aug/Sep 2008) 13. (a) Discuss data transformation from time domain to frequency domain. (R07 APRIL 2011) (b) Explain latent semantic inducing technique. Compute the Manhanttan distance between the two objects. multimedia database. using q=3. (a) Describe the essential features of temporal data and temporal inferences. (b) Define web mining.36. (NR APRIL 2011) UNIT-VII 1. (R07 APRIL 2011) 4. (b) Discuss about mining text databases. (a) Which frequent itemset mining is suitable for text mining and why. (a) Discuss various ways to estimate the trend. (JNTU Aug/Sep 2008) 11.1. sequence database. 8. (a) Explain spatial datacube construction and spatial OLAP. (JNTU May 2008) UNIT-VI 1. (R07 APRIL/MAY 2011) 6. (R07 APRIL/MAY 2011) 5. (JNTU Dec 2010) 10. but that need to exchange transform information among themselves and answer global queries. (a) Given two objects represented by the tuples (22. (JNTU Aug/Sep 2008) 12.42. (a) Explain spatial data cube construction and spatial OLAP. Compute the Euclidean distance between the two objects. (b) compare information retrieval with text mining.8): i. and text database. (R07 APRIL 2011) 3. time-series database.(b) What is a distance-based outlier? What are efficient algorithms for mining distance-based algorithm? How are outliers determined in this method? (JNTU May 2008) 20. A heterogeneous database system consists of multiple database systems that are defined independently. (a) Write algorithms for k-Means and k-Medoids. (a) How to mine Multimedia databases? Explain. Explain? (b) Discuss the relationship between text mining and information retrieval and information extraction. (b) Explain about Statistical-based outlier detection and Deviation-based outlier detection. (b) Explain HITS algorithm for web structure mining.10) and (20. (JNTU Dec 2010) (b) Explain in detail similarity search in time-series analysis. (a) Discuss web content mining and web usage mining. iii. (a) Define spatial database. Write short notes on: i) Mining Spatial Databases ii) Mining the World Wide Web. (b) Discuss about density-based methods. Compute the Minkowski distance between the two objects. What are the observations made in mining the web for effective resource and knowledge discovery? (c) What is web usage mining? (JNTU Aug/Sep 2008) .

by receiver. (a) How can such an e-mail database be structured so as to facilitate multi-dimensional search. and the traffic situation when a major accident occurs. An e-mail database is a database that stores a large number of electronic mail messages. and by weekdays. Discuss the following. Suppose that a city transportation department would like to perform data analysis on highway traffic for the planning of highway construction based on the city traffic data collected at different hours every day. (a) Give an example of generalization-based mining of plan databases by divide. Propose one mining technique that can efficiently mine interesting patterns from such a spatio-temporal data warehouse (JNTU May 2008) 16. by time.. It can be viewed as a semistructured database consisting mainly of text data. Describe how a data mining system may take this as the training set to automatically classify new e-mail messages for unclassified ones (JNTU May 2008) 15.and-conquer. (b) What is sequential pattern mining? Explain. by subject. and so on? (b) What can be mined from such an e-mail database? (c) Suppose you have roughly classified a set of your previous e-mail messages as junk. (b) What information can we mine from such a spatial data warehouse to help city planners? (c) This data warehouse contains both spatial and temporal data. by time of day. (c) Explain the construction of a multilayered web information base. (JNTU May 2008) UNIT-VIII . unimportant. normal. (a) Design a spatial data warehouse that stores the highway traffic information so that people can easily see the average and peak time traffic flow by highway. such as by sender. Explain the following: (a) Constriction and mining of object cubes (b) Mining associations in multimedia data (c) Periodicity analysis (d) Latent semantic indexing (JNTU May 2008) 17. or important.14.

sketch a method to mine one kind of knowledge from such stream data efficiently. such as stream/sensor data analysis.. However. Present an example where data mining is crucial to the success of a business . how are they similar? 5. 7. Thus an important consideration for computing descriptive data summary is whether a measure can be computed efficiently in incremental manner. new data sets are incrementally added to the existing large data sets. 6. exceptions in credit card transactions can help us detect the fraudulent use of credit cards. 10. Use a flowchart to summarize the following procedures for attribute subset selection: (a) stepwise forward selection . Give three additional commonly used statistical measures (i. What are the major challenges of mining a huge amount of data (such as billions of tuples) in comparison with mining a small amount of data (such as a few hundred tuple data set)? 3. 11. Describe various methods for handling this problem. and median as examples to show that a distributive or algebraic measure facilitates efficient incremental computation. (c) Identify and discuss the major challenges in spatiotemporal data mining. or bioinformatics 4. In many applications. one person‟s garbage could be another‟s treasure. not illustrated in this chapter) for the characterization of data dispersion. (b) Discuss what kind of interesting knowledge can be mined from such data streams. Taking fraudulence detection as an example. spatiotemporal data analysis. loose coupling.e. Describe the differences between the following approaches for the integration of a data mining system with a database or data warehouse system: no coupling. with limited time and resources. Use count.What data mining functions does this business need? Can they be performed alternatively by data query processing or simple statistical analysis? 9. 8. standard deviation. and discuss how they can be computed efficiently in large databases. In real-world data. Outliers are often discarded as noise. propose two methods that can be used to detect outliers and discuss which one is more reliable. A spatiotemporal data stream contains spatial information that changes over time. (d) Using one application example. For example. Describe three challenges to data mining regarding data mining methodology and user interaction issues. and is in the form of stream data (i. (a) Present three application examples of spatiotemporal data streams. Recent applications pay special attention to spatiotemporal data streams. whereas a holistic measure does not. tuples with missing values for some attributes are a common occurrence. Outline the major research challenges of data mining in one specific application domain.e. What is the difference between discrimination and classification? Between characterization and clustering? Between classification and prediction? For each of these pairs of tasks. 12. 2.. the data flow in and out like possibly infinite streams).ASSIGNMENT QUESTIONS: UNIT-I Unit-I 1.

A data warehouse can be modeled by either a star schema or a snowflake schema. Suppose that a data warehouse consists of the four dimensions. (a) Draw a star schema diagram for the data warehouse. Spectators may be students. Your design should facilitate efficient querying and on-line analytical processing. . Explain why it is not possible to analyze some large data sets using classical modeling techniques.structured. what specific OLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c) Bitmap indexing is useful in data warehousing. and the two measures. Compare the data quality issues involved in observational science with those of experimental science and data mining. suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given.(b) stepwise backward elimination (c) a combination of forward selection and backward elimination 13. and derive general weather patterns in multidimensional space. 14. which are scattered throughout various land and ocean locations in the region to collect basic weather data. count and charge. (b) Starting with the base cuboid [date. The weather bureau has about 1. adults. 19. 17. briefly discuss advantages and problems of using a bitmap index structure. date. semi . Analyze their respective complexity under different parameter settings and decide to what extent the real value can be approximated. 18. All data are sent to the central station. 2. location. which has collected such data for over 10 years. Moreover. Can a set with 50. 15. Why is it important that the data miner understand data well? Give examples of structured. spectator. spectator. Enumerate the tasks that a data warehouse may solve as a part of the data – mining process. where charge is the fare that a spectator pays when watching a game on a given date.learning approaches to the analysis of large data sets. and game. Design a data warehouse for a regional weather bureau. and precipitation at each hour.mining applications? 16. location. or seniors. Briefly describe the similarities and the differences of the two models. Unit-II 1. Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why? 20. Give your opinion of which might be more empirically useful and state the reasons behind your answer. Give examples of data where the time component may be recognized explicitly and other data where the time component is given implicitly in a data organization. The median is one of the most important holistic measures in data analysis. including air pressure. Taking this cube as an example. Propose several methods for median approximation. Many sciences rely on observation instead of (or in addition to) designed experiments. Explain the differences between statistical and machine .000 samples be called a large data set? Explain your answer. with each category having its own charge rate. game]. and then analyze their advantages and disadvantages with regard to one another. temperature. and unstructured data from everyday situations. Why are preprocessing and dimensionality reduction important phases in successful data .000 probes. 3.

virtual warehouse 16. 13. the fact table is deep. (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. star net query model (b) Data cleaning. (5) type of deal. You are the data design specialist on the data warehouse project team for a manufacturing company. 10. parts used. Present an example illustrating such a huge and sparse data cube. 15. In a STAR schema to track the shipments for a distribution company. (3) ship-from. (c) Starting with the base cuboid [day. this may often generate a huge. A popular data warehouse implementation is to construct a multidimensional database. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). fact constellation. and (6) mode of shipment. List the possible data sources from which you will bring the data into your data warehouse. 8. 6. designate a primary key for each table. doctor. A data warehouse is subject-oriented. refresh (c) Enterprise warehouse. (4) product. (2) customer ship-to. 11. What is a factless fact table? Design a simple STAR schema with a factless fact table to track patients in a hospital by diagnostic procedures and time. 5. Describe situations where the query-driven approach is preferable over the update-driven approach. what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? . rather than the query-driven approach (which applies wrappers and integrators). data transformation. for the integration of multiple heterogeneous information sources. data mart. You are the data analyst on the project team building a data warehouse for an insurance company. Prepare a table showing all the potential users and information delivery methods for a data warehouse supporting a large national grocery chain. Suppose that a data warehouse consists of the three dimensions time. where charge is the fee that a doctor charges a patient for a visit. production facility. Also. and the two measures count and charge. and production run. time. and patient. identify three operational applications that would feed into the data warehouse. patient]. Review these dimensions and list the possible attributes for each of the dimension tables. Briefly compare the following concepts. 9. many companies in industry prefer the update-driven approach (which constructs and uses data warehouses). A dimension table is wide. Production quantities are normally analyzed along the business dimensions of product. What would be the major critical business subjects for the following companies? (a) an international manufacturing company (b) a local community bank (c) a domestic hotel chain 12. State your assumptions. (a) Snowflake schema. You may use an example to explain your point(s). For an airlines company. Unfortunately. Explain.4. Design a STAR schema to track the production quantities. State why. doctor. 7. the following dimension tables are found: (1) time. yet very sparse multidimensional matrix. known as a data cube. What would be the data load and refresh cycles? 14. State your assumptions.

for a given student. b.. b. d. Let min sup = 60% and min con f = 80%. e} {b. the avg grade measure stores the actual course grade of the student. avg grade stores the average grade for the given combination. c} {a. (a) Draw a snowflake schema diagram for the data warehouse. c. c. and instructor combination). count. (b) Starting with the base cuboid [student. 18. d. b. e} {c.g. and instructor.. e} {b. month. patient. (a) Find all frequent itemsets using Apriori and FP-growth. d. semester. roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big University student. b. year. 17. c. charge). Cust_ID 1 1 2 2 3 3 4 4 5 5 Transaction ID 0001 0024 0012 0031 0015 0022 0029 0040 0033 0038 Items Bought {a.g. “A”. course. and two measures count and avg grade. At higher conceptual levels. e} {a. hospital. doctor. and itemi denotes variables representing items (e. respectively. e} .(d) To obtain the same list. Suppose that a data warehouse for Big University consists of the following four dimensions: student. “B”. When at the lowest conceptual level (e.): 2. semester. etc. e} {a. instructor]. Compare the efficiency of the two mining processes. write an SQL query assuming the data are stored in a relational database with the schema fee (day. e} {a. where X is a variable representing customers. e} {a.. what specific OLAP operations (e. Consider the following data set. Below database has five transactions.g. semester. d. (b) List all of the strong association rules (with support s and confidence c) matching the following metarule. d} {a. d. UNIT-III 1. course. course.

6. Discuss whether there are any relationships between s1 and s2 or c1 and c2. Transactions in each component database have the same format. using a programming language that you are familiarwith. using a programming language that you are familiar with. and pattern density) where one algorithm may perform better than the others. Implement frequent itemset mining algorithm Apriori (mining using vertical data format). 7. d. such as C++ or Java. 3. d} → {e} and {e} → {b. What is the essential difference between association rules and decision rules? 8. where Tj is a transaction identifier.(a) Compute the support for itemsets {e}. Write a report to analyze the situations (such as data size. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead. {b. Propose an efficient algorithm to mine global association rules (without considering multilevel associations). data distribution. d} → {e} and {e} → {b. Also. Discuss effective methods that can be used to reduce the number of rules generated while still preserving most of the interesting rules. Implement frequent itemset mining algorithm ECLAT (mining using vertical data format). (b) Use the results in part (a) to compute the confidence for the association rules {b. minimal support threshold setting. im). Is confidence a symmetric measure? (b) Repeat part (a) by treating each customer ID as a market basket. data distribution. and {b. minimal support threshold setting. Compare the performance of each algorithm with various kinds of large data sets. and ik (1<=k<=m) is the identifier of an item purchased in the transaction. …. Suppose that a large store has a transaction database that is distributed among four locations. d}. You may present your algorithm in the form of an outline. and 0 otherwise. and state why. 4. d}. (e) Suppose s1 and c1 are the support and confidence values of an association rule r when treating each transaction ID as a market basket. let s2 and c2 be the support and confidence values of r when treating each customer ID as a market basket.) (c) Use the results in part (c) to compute the confidence for the association rules {b.. Compare the performance of each algorithm with various kinds of large data sets. d}. and pattern density) where one algorithm may perform better than the others. Write a report to analyze the situations (such as data size. namely Tj : (i1. e} by treating each transaction ID as a market basket. and state why. Given a simple transactional database X: . Association rule mining often generates a large number of rules. 5. Each item should be treated as a binary variable (1 if an item appears in at least one transaction bought by the customer. such as C++ or Java.

Comment on the processing cost of mining multilevel associations with this method in comparison to mining single-level associations. Let the support and confidence thresholds be 10% and 60%. Describe the nature of the relationship between item a and item b in terms of the interest measure. 10. Implement frequent itemset mining algorithm “FP-growth” (mining using vertical data format). What are the common values for support and confidence parameters in the Apriori algorithm? Explain using the retail industry as an example. data distribution. 9. 12. identifying frequent and sub frequent items.Using the threshold values support = 25% and confi dence = 60%. (c) analyze misleading associations for the rule set obtained in (b). Solve question 1 with support of 50% and confidence 60%. 16. and pattern density) where one algorithm may perform better than the others. using a programming language that you are familiar with. Compare the performance of each algorithm with various kinds of large data sets. Propose and outline a level-shared mining approach to mining multilevel association rules in which each item is encoded by its level position. 13. (c) What conclusions can you draw from the results of parts (a) and (b)? . and an initial scan of the database collects the count for each item at each concept level. respectively. Is the rule interesting according to the confidence measure? (b) Compute the interest measure for the association pattern {a. From the transactional database of question 7. (a) Compute the confidence of the association rule {a} → {b}. b} is 20%. Write a report to analyze the situations (such as data size. (b) find strong association rules for database X. Why is the process of discovering association rules relatively simple compared with generating large itemsets in transactional databases? 11. minimal support threshold setting. Using data in question 2 find all association rules using FP-Growth with a support of 50% and confidence 60%. 17. If the support for item a is 25%. find FP tree for this database if (a) support threshold is 5 (b) support threshold is 3 (c) support threshold is 4 draw the tree step by step. Solve question 18 with support of 50% and confidence 60%. (a) find all large itemsets in database X. 14. the support for item b is 90% and the support for itemset {a. and state why. Suppose we have market basket data consisting of 100 transactions and 20 items. b}. 15. such as C++ or Java.

prove this equation 3) Suppose that we would like to select between two prediction models. 7) Take a data set with at least 5 attributes and 15 records and apply decision tree ( gain ratio)classification. where the same data partitioning in round i is used for both M1 and M2.4. 32. you have the option of (i) converting the decision tree to rules and then pruning the resulting rules.5. 31. I2. 2) Show that accuracy is a function of sensitivity and specificity.7.0. that is. In such cases.5. What advantage does (i) have over (ii)? 5) Take a data set with at least 5 attributes and 15 records and apply decision tree (information gain)classification.7.0. I5 UNIT-IV 1) It is difficult to assess classification accuracy when individual data objects may belong to more than one class at a time.7. 19. 22. I2. comment on what criteria you would use to compare different classifiers modeled after the same data. Comment on whether one model is significantly better than the other considering a significance level of 1%. The error rates for M2 are 22. 20. The error rates obtained for M1 are 30. I3 T4 T5 T6 I2. 41. 26. or (ii) pruning the decision tree and then converting the pruned tree to rules. 19. I3 List of I1. TID T1 T2 T3 I2.I3. 35. 19. T9 I2 I1. I2. 20. Using data in question 2 find all association rules using apriori with a support of 50% and confidence 60%.4. I3 T7 I1. From the transactional database of question 7. 14. 27.(d) Prove that if the confidence of the rule {a} −→ {b} is less than the support of {b}. We have performed 10 rounds of 10-fold cross-validation on each model.4.4.0. 6) Why is naïve Bayesian classification called “naïve”? Briefly outline the major ideas of naïve Bayesian classification. 22. 18.2. 4) Given a decision tree.0. find FP tree for this database if (a) support threshold is 2 (b) support threshold is 3 (c) support threshold is 5 draw the tree step by step.2. 16. Minimum support is taken as 22% and minimum confidence is 70%. I3 I4 .6.6. 20. I2.5. 26. then: i. 20. M1 and M2. c({a} → {b}) > s({b}). ii.1. Generate association rules from the following dataset. where c(·) denote the rule confidence and s(·) denote the support of an itemset. c({a} → {b}) > c({a} → {b}). 20. 21.0. I1. I3 T8 I1. . I4 Items I5 I1.

10) The following table summarizes a data set with three attributes A.8) The following table shows the midterm and final exam grades obtained for studentsin a database course. B. tuples with missing values for some attributes are a common occurrence. −. 9) Consider the following data set for a binary class problem. C and two class labels +. A B Class Label T F + T T + T T + T F _ T T + F F _ F F _ F F _ T T _ T F _ (a) Calculate the information gain when splitting on A and B. Do x and y seem to have a linear relationship? (ii) Use the method of least squares to find an equation for the prediction of a student‟s final exam grade based on the student‟s midterm grade in the course. x y Midterm exam Final exam 72 84 50 63 81 77 74 78 94 90 86 75 59 49 83 79 65 77 33 52 88 74 81 90 (i) Plot the data. Which attribute would the decision tree induction algorithm choose? (c)In real-world data. Build a two-level decision tree. . (iii) Predict the final exam grade of a student who received an 86 on the midterm exam. Describe various methods for handling this problem. Which attribute would the decision tree induction algorithm choose? (b) Calculate the gain in the Gini index when splitting on A and B.

has red hair and blue eyes. (b). show the contingency table and the gains in classification error rate. Predict the credit rating for a new customer who is a)short.(a) According to the classification error rate. (c)How many instances are misclassified by the resulting decision tree? (d) Repeat parts (a). and (c) using C as the splitting attribute. 11) Using Bayesian method and Customer ID E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 Height Short Tall Tall Tall Short Tall Average Average Tall Average Tall Short Average Tall Hair Dark Dark Dark red Blond Blond Blond Blond Grey Grey Blond Blond Grey red Eyes Blue Black Blue Black Brown Black Blue Blue Blue Black Brown Blue Brown Brown Credit Rating A B B C B B C B A B A B B C Using data set given in the above table assuming the credit rating be the class label attribute. (b)Repeat for the two children of the root node. . which attribute would be chosen as the first splitting attribute? For each attribute.

what modified design would you suggest? 16) What is boosting? State why it may improve the accuracy of decision tree induction. And Predict the eyes when credit rating is „B‟ . such as one from a week ago).g. 20) Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of tuples to evaluate pruning? .. 18) Write an algorithm for k-nearest-neighbor classification given k and n.g. short height and dark hair. Develop a scalable naive Bayesian classification algorithm that requires just a single scan of the entire data set for most databases. the number of attributes describing each tuple.. Bayesian. decision tree. 19) What is associative classification? Why is associative classification able to achieve higher classification accuracy than a classical decision tree method? Explain how associative classification can be used for text document classification.g.e. 15) Design an efficient method that performs effective naïve Bayesian classification over an infinite data stream (i. Discuss how to overcome this difficulty and develop a scalable SVM algorithm for efficient SVM classification in large datasets.. Discuss whether such an algorithm can be refined to incorporate boosting to further enhance its classification accuracy.b)tall. has blond hair and brown eyes. If we wanted to discover the evolution of such classification schemes (e. 12) Apply GINI on the data in the table of question number-11 and create decision tree. case based reasoning). 17) The support vector machine (SVM) is a highly accurate classification method. neural network) versus lazy classification (e. comparing the classification scheme at this moment with earlier schemes. you can scan the data stream only once). 14) Compare the advantages and disadvantages of eager classification (e. SVM classifiers suffer from slow processing when training with a large set of data tuples.. However. k-nearest neighbor. 13) RainForest is an interesting scalable algorithm for decision tree induction.

- Data Mining Notes1
- Associion Rule Mining
- Write an ALP for All Arithematic Operations and Write ALP for Product of Two Numbers Withoutusing MUL Operation - Copy
- Unit 6
- ACA Unit 3
- ACA Unit-5
- ACA UNit 1
- Aca
- CS1994
- CS1993
- CS1991
- Gate 2009 CS Answer Keys
- General Keyboard Shortcuts
- SingleLayerPerceptron
- Operating Systems Lab Manual JNTU
- Software Quality Assurance and Testing
- Computer Graphics JNTU Question paper
- L09
- Operating Systems Basics
- OS2

Hand book of Data Mining JNTU syllubus

Hand book of Data Mining JNTU syllubus

- Snowflake Gau
- Data Warehousing Fundamentals
- Homework2 Sol
- Chapter 3
- Data Warehousing Fundamentals Paulraj Ponniah
- Midterm Review Solution
- Case Study Big University
- Unit-1_dwdm
- Data Mining and Warehousing
- Data Warehousing Fundamentals by Paulraj Ponniah_WeLearnFree
- Midterm
- CS412 Assignment 1 Ref Solution (1)
- Week2 Home Work
- Character Stuffing
- VTU 7TH SEM CSE/ISE DATA WAREHOUSING & DATA MINING NOTES 10CS755/10IS74
- Data Warehousing and Data Mining JNTU Previous Years Question Papers
- 53_DatawareHouse
- Data Mining
- A Study on Some of Data Warehouses and Data Mining (Case Study of Data Mining for Environmental Problems)
- Data Mining and Warehousing
- Data Mining
- Dwm Course
- CS2032 DWM QB3.pdf
- Data Mining
- Syllabus_datamining
- Cs2032 2 Marks
- dmTokyoLect04-120128204942.pdf
- Data Mining
- Data Warehousing & Data Mining
- Toc
- Data Warehousing and Data Mining_handbook

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd