You are on page 1of 9

# Amity Campus Uttar Pradesh India 201303

ASSIGNMENTS
PROGRAM: BSc IT SEMESTER-VI
Subject Name Study COUNTRY Roll Number (Reg.No.) Student Name : Data Warehousing and Mining : : :

INSTRUCTIONS a) Students are required to submit all three assignment sets. ASSIGNMENT Assignment A Assignment B Assignment C b) c) d) e) DETAILS Five Subjective Questions Three Subjective Questions + Case Study Objective or one line Questions MARKS 10 10 10

Total weightage given to these assignments is 30%. OR 30 Marks All assignments are to be completed as typed in word/pdf. All questions are required to be attempted. All the three assignments are to be completed by due dates and need to be submitted for evaluation by Amity University. f) The students have to attached a scan signature in the form.

Signature : Date :

_________________________________ _________________________________

## Data Warehousing and Mining

Assignment A
Q1. Discuss various types of concept hierarchies by providing two examples for each type? Q2 Illustrate the typical requirements of clustering data mining. Q3 State various evaluation criteria that are essential for classification and prediction methods. Q4. What is meant by data reduction? Discuss any two data reduction strategies for obtaining a reduced data representation. Q5 Differentiate between STAR and SNOWFLAKE schemas Q6 State the salient differences between data query and knowledge query?

Assignment B
Q.1 Case Study Q1. Suppose that a data warehouse consists of four dimensions date, viewer, cinema hall and movie and two measures count and charge, where charge is the ticket fee that the viewer pays for watching the movie on a given date. The viewers can be children below 5, above 5, adults or seniors with each category having its own charge rate. i) Draw a star schema diagram for data warehouse. ii) Starting with the base cuboid [date, viewer, cinema hall, movie], what specific OLAP operations one should perform in order to list the total charge paid by adults at the cinema hall Paradise in 2004? Q2. Give an example to show that items in a strong association rule may actually be negatively correlated. Q3. What are Bayesian classifiers? Explain the theorem on which Bayesian classification is based.. Q4. Explain the application of data mining in CRM in Healthcare. How Data Mining algorithms can be implemented in CRM.

Assignment C
1) Which of the following statements correctly describe a Dimension table in Dimensional Modeling? 1: Dimension tables contain fields that describe the facts. 2: Dimension tables do not contain numeric fields. 3: Dimension tables are typically larger than fact tables. 4: Dimension tables do not need system-generated keys. 5: Dimension tables usually have fewer fields than fact tables 2) How are dimensions in a Multi-Dimensional Database related? 1: Hierarchically. 2: Through foreign keys. 3: Through a hierarchy and foreign keys. 4: Through a network. 5: Through an inverse list. 3) What is a primary risk of a 'phased' implementation? 1: Previous implementations may need to be reworked. 2: The project may lose momentum. 3: Business Analysts will find problems in the data sooner. 4: Executives will lose focus. 5: The project budget may be exceeded. 4) How do highly distributed source systems impact the Data Warehouse or Data Mart project? 1: The source data exists in multiple environments. 2: The location of the source systems has minimal impact on the Data Warehouse or Data Mart implementation. 3: The timing and coordination of software development, extraction, and data updates are more complex. 4: Large volumes of data must be moved between locations. 5: Additional network and data communication hardware will be needed. 5) OLAP tool (as described above)? 1: Drill down to another level of detail. 2: Display the top 10 items that meet a specific selection criteria. 3: Trend analysis. 4: Calculate a rolling average on a set of data. 5: Display a report based on specific selection criteria. 6) In a Data Mart Only architecture, what will the Data Mart Development Team(s) encounter? 1: There is little or no minimal data redundancy across all of the Data Mart databases. 2: Issues such as inconsistent definitions and dirty data in extracting data from multiple source systems will be addressed several times. 3: Database design will be easier than expected because Data Mart databases support only a single user. 4: There is ease in consolidating the Data Marts to create a Data Warehouse. 5: It is easy to develop the data extraction system due to the use of the warehouse as a single datasource.

7) What is the primary responsibility of the 'project sponsor' during a Data Warehouse project? 1: To manage the day-to-day project activity. 2: To review and approve all decisions concerning the project. 3: To approve and monitor the project budget. 4: To ensure cooperation and support from all 'involved' departments. 5: To communicate project status to higher management and the board of directors. 8) What are Metadata? 1: Data used only by the IS organization. 2: Information that describes and defines the organization's data. 3: Definitions of data elements. 4: Any business data occurring in large volumes. 5: Summarized data. 9) How can the managers of a department best understand the cost of their use of the data warehouse? 1: A percentage of the business department's budget should be directed to the maintenance and enhancement of the Data Warehouse. 2: Institute a charge-back system of computer costs for the access to the Data Warehouse. 3: Develop a training program for department management. 4: Provide executive management with computer utilization reports that show what percentage of utilization is due to the Data Warehouse. 5: Business managers should participate in the acquisition process for computer hardware and software. 10) Which of the following is NOT a consequence of the creation of independent Data Marts? 1: Potentially different answers to a single business question if the question is asked of more than one Data Mart. 2: Increase in data redundancy due to duplication of data between the Data Marts. 3: Consistent definitions of the data in the Data Marts. 4: Creation of multiple application systems that have duplicate processing due to the duplication of data between the Data Marts. 5: Increased costs of hardware as the databases in the Data Marts grow. 11) What is meant by artificial intelligence when it is applied to data cleansing and transformation tools? 1: The tool can perform highly complex mathematical and statistical calculations to create derived data elements. 2: The tool can accomplish highly complex code translations when data comes from multiple source systems. 3: The tool can determine through heuristics the changes needed for a set of dirty data and then make the changes. 4: The tool can perform highly complex summarizations across multiple databases. 5: The tool can identify data that appears to be inconsistent between multiple source systems and provide reporting to assist in the clean up of the source system data. 12) Which of the following classes of corporations can gain the most insights from their legacy data? 1: A corporation that wants to determine the attitude of its customers towards the

corporation. 2: A corporation that offers new products and services. 3: A new corporation. 4: A corporation that has existed for a long time. 5: A corporation that is constantly introducing new and different products and services. 13) Which of the following is NOT found in an Entity Relationship Model? 1: A definition for each Entity and Data Element. 2: Entity Relationship Diagram 3: Entity and Data Element Names 4: Fact and Dimension Tables 5: Business Rules associated with the entities, entity relationships, and the data elements. 14) What is Data Mining? 1: The capability to drill down into an organization's data once a question has been raised. 2: The setting up of queries to alert management when certain criteria are met. 3: The process of performing trend analysis on the financial data of an organization. 4: The automated process of discovering patterns and relationships in an organization's data. 5: A class of tools that support the manual process of identifying patterns in large databases. 15) What does implementing a Data Warehouse or Data Mart help reduce? 1: The data gathering effort for data analysis. 2: Hardware costs. 3: User requests for custom reports. 4: Costs when management downsizes the organization. 5: All of the above. 16) Profitability Analysis is one of the most common applications of data warehousing. Why is Profitability Analysis in data warehousing more difficult than usually expected? 1: Almost every manager in an organization wants to get profitability reports. 2: Revenue data cannot be tracked accurately. 3: Expense data is often tracked at a higher level of detail than revenue data. 4: Revenue data is difficult to collect and organize. 5: Transaction grain data is required to properly compute profitability figures. 17) An operational system is which of the following? A. A system that is used to run the business in real time and is based on historical data. B. A system that is used to run the business in real time and is based on current data. C. A system that is used to support decision making and is based on current data. D. A system that is used to support decision making and is based on historical data. 18) A data warehouse is which of the following? A. Can be updated by end users. B. Contains numerous naming conventions and formats. C. Organized around important subject areas. D. Contains only current data. 19) The load and index is which of the following? A.A process to reject data from the data warehouse and to create the necessary indexes B.A process to load the data in the data warehouse and to create the necessary indexes

C.A process to upgrade the quality of data after it is moved into a data warehouse D.A process to upgrade the quality of data before it is moved into a data warehouse 20) The extract process is which of the following? A. Capturing all of the data contained in various operational systems B. Capturing a subset of the data contained in various operational systems C. Capturing all of the data contained in various decision support systems D. Capturing a subset of the data contained in various decision support systems 21) A star schema has what type of relationship between a dimension and fact table? A. Many-to-many B. One-to-one C. One-to-many D. All of the above. 22) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. 2: Business analysts access the Data Warehouse data in frequently. 3: Business analysts access the Data Warehouse data from different locations. 4: Business analysts do not know data requirements prior to beginning work. 5: Business analysts use sampling techniques. 23) What should be the business analyst's involvement in monitoring the performance of a Data Warehouse or Data Mart ? 1: Be patient when load monitoring on the Data Warehouse or Data Mart is taking place. 2: Become experts in SQL queries. 3: No involvement in performance monitoring. 4: Contact IT if a query takes too long or does not complete. 5: Complete all required training on the query tools they will be using 24) What factor heavily influences data warehouse size estimates? 1: The design of the warehouse schemas 2: The size of the source system schemas 3: The record size of the source tables 4: The number of expected data warehouse users 5: The number of customers an organization has Data warehouses or data marts allow organizations to define 'alert' conditions -- an alert is raised when something noteworthy has taken place. For implementing a facility of 'alerts', 25) What is the advantage of using a WEB interface over a client/server approach? 1: Access to the 'Alert' report is possible through a highly accessible means already available within the organization. 2: The selection criteria used in determining when an 'alert' needs to be issued is easier to implement using a WEB browser. 3: As long as the appropriate individual can access the 'alert', how it is implemented does not present an advantage. 4:'Alerts' can be directed only to the requestor of the 'alert'. 5: Access to the 'alert' data can be tightly controlled. 26). Transient data is which of the following? A. Data in which changes to existing records cause the previous version of the records to be eliminated

B. Data in which changes to existing records do not cause the previous version of the records to be eliminated C. Data that are never altered or deleted once they have been added D. Data that are never deleted once they have been added 27). A. B. C. D. A multifield transformation does which of the following? Converts data from one field into multiple fields Converts data from multiple fields into one field Converts data from multiple fields into multiple fields All of the above

28). A snowflake schema is which of the following types of tables? A.Fact B.Dimension C.Helper D.All of the above 29). The generic two-level data warehouse architecture includes which of the following? A.At least one data mart B.Data that can extracted from numerous internal and external sources C.Near real-time updates D.All of the above. 30). Fact tables are which of the following? A.Completely denoralized B.Partially denoralized C.Completely normalized D.Partially normalized 31. Data transformation includes which of the following? A. A process to change data from a detailed level to a summary level B.A process to change data from a summary level to a detailed level C.Joining data from one source into various sources of data D.Separating data from one source into various sources of data 32. Information is a. Data b. Processed Data c. Manipulated input d. Computer output 33. Data by itself is not useful unless a. It is massive b. It is processed to obtain information c. It is collected from diverse sources d. It is properly stated 34 What are the three essential components of a learning system? Give a definition of each. Give an example of each, including equations where necessary. (1 mark) A. Model, gradient descent, learning algorithm B. Error function, model, learning algorithm C. Accuracy, Sensitivity, Specificity

D. Model, error function, cost function 35. The error function most suited for gradient descent using logistic regression is A. The entropy function B. The squared error C. The cross-entropy function D. The number of mistakes 36. After SVM learning, each Lagrange multiplier ai takes either zero or non-zero value. What does it indicate in each situation? A. A non-zero ai indicates the datapoint i is a support vector, meaning it touches the margin boundary. B. A non-zero ai indicates that the learning has not yet converged to a global minimum. C. A zero ai indicates that the datapoint i has become a support vector datapoint, on the margin. D. A zero ai indicates that the learning process has identified support for vector i. 37. A Bayesian Network is most accurately described as A. A special case of a neural network that makes use of Bayes Theorem. B. The network variant of Bayes Theorem, assuming independent features. C. A probabilistic model of which Naive Bayes is a special case. D. A network of probabilistic learning functions, connected by Bayes Rule. 38. Data scrubbing is which of the following? A. A process to reject data from the data warehouse and to create the necessary indexes B. A process to load the data in the data warehouse and to create the necessary indexes C. A process to upgrade the quality of data after it is moved into a data warehouse D. A process to upgrade the quality of data before it is moved into a data warehouse 39. The active data warehouse architecture includes which of the following? A. At least one data mart B. Data that can extracted from numerous internal and external sources C. Near real-time updates D. All of the above. 40. A goal of data mining includes which of the following? A. To explain some observed event or condition B. To confirm that data exists C. To analyze data for expected relationships D. To create a new data warehouse