You are on page 1of 71

Subject: DATA WAREHOUSING AND DATAMINING (CT802)

TH
Semester: 8 SEM
Branch: Computer Technology (C.T.) - OLD SYLLABUS

NOTE: CORRECT ANSWERS ARE SHOWN IN BOLD


UNIT 1
1. ____table is easy to maintain and saves storage space due to normalization.
a. Star
b. Snowflake (Correct Answer)
c. Fact Constellation
d. Starnet
2. Metadata repository can be categorized as independent or dependent
a. True
b. False
3. A _____ is a set of views over operational databases
a. Data Mart
b. virtual warehouse
c. Enterprise Warehouse
d. Metadata repository
4. A data cube allows data to be modeled and viewed in multiple dimensions which is defined by____
a. dimensions and records
b. dimensions and properties
c. dimensions and attributes
d. dimensions and facts
5. The ______ operation defines a subcube by performing a selection on two or more dimensions.
a. Roll-up
b. Drill-down
c. Slice
d. Dice
e. Pivot (rotate)
6. An OLAP query needs only read only access to stored data.
a. True
b. False
7. The _____ operation uses relational SQL facilities to drill through the bottom level of a data cube
down to its back-end relational tables.
a. Drill-across
b. Drill-down
c. Drill-through
d. Slice and Dice
e. Pivot (rotate)
8. By stepping down a concept hierarchy for a dimension is ____ operation
a. Roll-up
b. Drill-down
c. Slice and dice
d. Pivot (rotate)
9. _________________ structure can reduce the effectiveness of browsing, since more joins will
be needed to execute a query
a. Star
b. Snowflake
c. Fact Constellation
d. Starnet
10. When _____ is performed, one or more dimensions from the data cube are added.
a. Roll-up
b. Drill-down
c. Slice and dice
d. Pivot (rotate)
11. For data marts, _________ schema is commonly used, since both are geared toward modeling single
subjects, although the star schema is more popular and efficient
a. Star
b. Snowflake
c. Fact Constellation
d. Starnet
12. ______is a visualization operation that rotates the data axes in view to provide an alternative data
presentation
a. Roll-up
b. Drill-down
c. Slice
d. Dice
e. Pivot (rotate)
13. The dimension tables of the ____model may be kept in normalized form to reduce redundancies
a. Star
b. Snowflake
c. Fact Constellation
d. Starnet
14. The _____ operation performs a selection on one dimension of the given cube, resulting in a subcube.
a. Roll-up
b. Drill-down
c. Slice
d. Dice
e. Pivot (rotate)
15. The _____ executes queries involving (i.e., across) more than one fact table.
a. Drill-across
b. Drill-down
c. Drill-through
d. Slice and Dice
e. Pivot (rotate)
16. Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy.
a. True
b. False
17. The data are extracted using application program interfaces known as
a. Gates
b. Gateways
c. Path
d. Pathways
18. Which tier contains a metadata repository in DW Architecture?
a. Bottom Tier
b. Middle Tier
c. Top Tier
d. Data Source
19. Dimensions are numeric measures.
a. True
b. False
20. _______ table/s may have redundancy in the star schema.
a. fact
b. dimensions
c. both dimensions and facts
d. No
21. ___________ is a subset of a data warehouse which contain small slices of data.
a. Flat Files
b. Metadata Repository
c. Data Marts
d. Data warehouse Server
22. For data warehouses, the __________ schema are commonly used, since it can model multiple,
interrelated subjects.
a. Star
b. Snowflake
c. Fact Constellation
d. Starnet
23. The data is aggregated by ascending the location hierarchy from the level of "city" to the level of
"country" (city->country) is done using
a. Roll-up
b. Drill-down
c. Slice and dice
d. Pivot (rotate)
24. Snowflake schema is more popular than Star schema.
a. True
b. False
25. Data warehouse systems use front-end tools and utilities to populate and refresh their data are called
ETL Process.
a. True
b. False
26. "Data Warehouse" was first coined in the year
a. 1990
b. 1991
c. 1992
d. 1993
27. _____ performs aggregation on a data cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction
a. Roll-up
b. Drill-down
c. Slice and dice
d. Pivot (rotate)
28. A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level,
more general concepts
a. True
b. False
29. "Data Warehouse" was first coined
a. Bill Inmonn
b. Bill Inmon
c. Bill Imnon
d. Bill Innonn
30. An extended relational DBMS that maps operations on multidimensional data to standard relational
operations is
a. HOLAP
b. ROLAP
c. MOLAP
d. All of the above
UNIT 2
1. Data Cleaning is:
a. to combine multiple data sources
b. to transform data into appropriate for mining
c. to identify the interesting patterns
d. to remove noise and inconsistent data
2. Data Integration is:
a. to combine multiple data sources
b. to transform data into appropriate for mining
c. to identify the interesting patterns
d. to remove noise and inconsistent data
3. Data Transformation is:
a. to combine multiple data sources
b. to transform data into appropriate for mining
c. to identify the interesting patterns
d. to remove noise and inconsistent data
4. Pattern Evaluation is:
a. to combine multiple data sources
b. to transform data into appropriate for mining
c. to identify the interesting patterns
d. to remove noise and inconsistent data
5. Knowledge Presentation is:
a. to combine multiple data sources
b. to transform data into appropriate for mining
c. visualization to present mined knowledge
d. to remove noise and inconsistent data
6. Data Mining is a part of KDD Process?
a. True
b. False
7. Is KDD a part of the Data Mining Process?
a. True
b. False
8. Data Mining is:
a. to combine multiple data sources
b. to discover interesting patterns from large amount of data
c. visualization to present mined knowledge
d. to remove noise and inconsistent data
9. What kind of data can be mined?
a. database data
b. data warehouses
c. Transactional data
d. All of the above
10. Example of Time related data is
a. Historical records, biological sequence, stock exchange records
b. Video Surveillance and sensor data
c. design of building and system components
d. text, images, video, audio data
11. Example of Multimedia data is
a. Historical records, biological sequence, stock exchange records
b. Video Surveillance and sensor data
c. design of building and system components
d. text, images, video, audio data
12. Example of Engineering Design Data is
a. Historical records, biological sequence, stock exchange records
b. Video Surveillance and sensor data
c. design of building and system components
d. text, images, video, audio data
13. Example of Data Streams is
a. Historical records, biological sequence, stock exchange records
b. Video Surveillance and sensor data
c. design of building and system components
d. text, images, video, audio data
14. Data discrimination is
a. Summarization of general features of target class of data
b. comparison of general features of target class of data
c. Differences between the general features of target class of data
d. None of the above
15. Data Characterization is
a. Summarization of general features of target class of data
b. comparison of general features of target class of data
c. Differences between the general features of target class of data
d. None of the above
16. Mining Frequent Pattern is
a. Summarization of general features of target class of data
b. comparison of general features of target class of data
c. Differences between the general features of target class of data
d. None of the above
17. A decision Tree is
a. Process of finding a model that describes data classes or concepts.
b. collection of neuron like processing units with weighted connections between the units
c. flowchart like structure
d. statistical methodology used for numeric prediction
18. Classification is
a. Process of finding a model that describes data classes or concepts.
b. collection of neuron like processing units with weighted connections between the units
c. flowchart like structure
d. statistical methodology used for numeric prediction
19. Neural Network is
a. Process of finding a model that describes data classes or concepts.
b. collection of neuron like processing units with weighted connections between the
units
c. flowchart like structure
d. statistical methodology used for numeric prediction
20. Outlier Analysis is also called as Anomaly Detection
a. True
b. False
21. Objective measures of interesting patterns are
a. Average and Mean
b. Mean and Mode
c. Mode and Support
d. Support and Confidence
UNIT 3
1. Data set {brown, black, blue, green, red} is example of Select one:
a. Continuous attribute
b. Ordinal attribute
c. Numeric attribute
d. Nominal attribute
2. Which of the following is not a data pre-processing method Select one?
a. Data Visualization
b. Data Discretization
c. Data Cleaning
d. Data Reduction
3. Which of the following is NOT an example of ordinal attributes? Select one:
a. Zip codes
b. Ordered numbers
c. Movie ratings
d. Military ranks
4. In asymmetric attribute Select one:
a. No value is considered important over other values
b. All values are equals
c. Only non-zero value is important
d. Range of values is important
5. Identify the example of Nominal attribute Select one:
a. Temperature
b. Salary
c. Mass
d. Gender
6. Nominal and ordinal attributes can be collectively referred to as_________ attributes. Select one:
a. perfect
b. Qualitative
c. consistent
d. optimized
7. In Binning, we first sort data and partition into (equal-frequency) bins and then which of the following
is not a valid step Select one:
a. smooth by bin boundaries
b. smooth by bin median
c. smooth by bin means
d. smooth by bin values
8. Incorrect or invalid data is known as _________ Select one:
a. Missing data
b. Outlier
c. Changing data
d. Noisy data
9. Which of the following is NOT a data quality related issue? Select one:
a. Missing values
b. Outlier records
c. Duplicate records
d. Attribute value range
10. Which of the following is not a Data discretization Method? Select one:
a. Histogram analysis
b. Cluster Analysis
c. Data compression
d. Binning
11. Euclidean distance measure is
A. A stage of the KDD process in which new data is added to the existing selection.
B. The process of finding a solution for a problem simply by enumerating all possible solutions
according to some predefined order and then testing them
C. The distance between two points as calculated using the Pythagoras theorem
D. None of these
12. Binary attribute are
A.This takes only two values. In general, these values will be 0 and 1 (1 BIT)
B.The natural environment of a certain species
C.Systems that can be used without knowledge of internal operations
D.None of these
13. The distance between two points calculated using Pythagoras theorem is Select one:
a. Supremum distance
b. Euclidean distance
c. Linear distance
d. Manhattan Distance
14. The most common and effective numeric measure of the “center” of a set of data is the_______
a. Mean
b. Median
c. Mode
d. midrange
15. The value that occurs most frequently in the set is____
a. Mean
b. Median
c. Mode
d. midrange
16. _____ is the average of the largest and smallest values in the set.
1. Mean
2. Median
3. Mode
4. midrange
17. The Euclidean, Manhattan, and Minkowski distances are proximity measures for _____ data
a. Ordinal
b. Binary
c. Numeric
d. Nominal
18. Chebyshev distance is also popularly known as ____ distance
a. Euclidean
b. Manhattan
c. Minkowski
d. Supremum
19. City Block Distance is also popularly known as __ distance
a. Euclidean
b. Manhattan
c. Minkowski
d. Supremum
20. Tanimoto distance is a simple variation of
a. Minkowski Distance
b. Supremum Distance
c. cosine similarity
d. Manhattan Distance
UNIT 4
1. Which of the following data mining tasks is known as Market Basket Analysis? Select one:
a. Association Analysis
b. Regression
c. Classification
d. Outlier Analysis
2. Correlation analysis is used for Select one:
a. handling missing values
b. identifying redundant attributes
c. handling different data formats
d. eliminating noise
3. In a data mining task where it is not clear what type of patterns could be interesting, the data mining
system should Select one:
a. allow interaction with the user to guide the mining process
b. performs both descriptive and predictive tasks
c. performs all possible data mining tasks
d. handle different granularities of data and patterns
4. Which of the following are descriptive data mining activities? Select one:
a. Deviation detection
b. Classification
c. Clustering
d. Regression
5. Which of the following is an Entity identification problem? Select one:
a. One person with different email addresses
b. One person’s name written in different way
c. Title for person
d. One person with multiple phone numbers
6. This data transformation technique works well when minimum and maximum values for a real-valued
attribute are known. Select one:
a. z-score normalization
b. min-max normalization
c. logarithmic normalization
d. decimal scaling
7. The number of iterations in apriori ___________ Select one:
a. increases with the size of the data
b. decreases with the increase in size of the data
c. increases with the size of the maximum frequent set
d. decreases with increase in size of the maximum frequent set
8. Frequent itemsets is Select one:
a. Superset of only closed frequent item sets
b. Superset of only maximal frequent item sets
c. Subset of maximal frequent item sets
d. Superset of both closed frequent item sets and maximal frequent item sets
9. In Apriori algorithm, if 1 item-sets are 100, then the number of candidate 2 item-sets are
Select one:
a. 100
b. 4950
c. 200
d. 5000
10. Significant Bottleneck in the Apriori algorithm is Select one:
a. Finding frequent itemsets
b. Pruning
c. Candidate generation
d. Number of iterations
11. The probability of a hypothesis before the presentation of evidence. Select one:
a. a priori
b. posterior
c. conditional
d. subjective
12. ________ in a decision tree is the root node.
a. Each non-leaf node
b. Each leaf node
c. The topmost node
d. None of the above
13. _______are the splitting rules which determine how the tuples at a given node are to be split.
a. Attribute selection measures
b. Attribute splitting measures
c. Features Splitting measures
d. Rules Splitting measures
14. Information Gain, Gain Ratio, Gini Index are the three popular _____
a. Attribute selection measures
b. Attribute splitting measures
c. Features Splitting measures
d. Rules Splitting measures
15. The Gini index is biased toward ______attributes
A. Single valued
B. Multivalued
C. Both A and B
D. None of the above
16. Preprunning and Postprunning are the two popular methods of ____
a. Branch pruning
b. Leaf pruning
c. Tree pruning
d. Root pruning
17. The Apriori algorithm is an algorithm for mining frequent itemsets for Boolean association rules.
a. True
b. False
18. Frequent pattern growth is a method of mining frequent itemsets without candidate generation.
a. True
b. False
19. Apriori is better and cheaper than Frequent pattern growth as it is a method that mines the complete set
of frequent itemsets without a costly candidate generation process
a. True
b. False
20. ________ associations involve data at more than one abstraction level
a. Multidimensional
b. Multilevel
c. Quantitative
d. Qualitative

UNIT 5
1. This clustering algorithm terminates when mean values computed for the current iteration of the
algorithm are identical to the computed mean values for the previous iteration Select one:
a. K-Means clustering
b. conceptual clustering
c. expectation maximization
d. agglomerative clustering
2. Find odd man out Select one:
a. DBSCAN
b. K means
c. PAM
d. K medoid
3. Which statement is true about the K-Means algorithm? Select one:
a. The output attribute must be categorical.
b. All attribute values must be categorical.
c. All attributes must be numeric
d. Attribute values may be either categorical or numeric
4. Which of the following is cluster analysis? Select one:
a. Simple segmentation
b. Grouping similar objects
c. Labeled classification
d. Query results grouping
5. A good clustering method will produce high quality clusters with Select one:
a. high inter class similarity
b. low intra class similarity
c. high intra class similarity
d. no inter class similarity
6. Which statement about outliers is true? Select one:
a. Outliers should be part of the training dataset but should not be present in the test data.
b. Outliers should be identified and removed from a dataset.
c. The nature of the problem determines how outliers are used
d. Outliers should be part of the test dataset but should not be present in the training data.
7. What does K refer to in the K-Means algorithm which is a non-hierarchical clustering approach? Select
one:
a. Complexity
b. Fixed value
c. No of iterations
d. number of clusters
8. Which of the following mentioned clustering methods divides the data into k groups such that each
group must contain at least one object.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
9. Which of the following mentioned clustering methods creates a hierarchical decomposition of the
given set of data objects
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
10. Which of the following mentioned clustering methods continue to grow a given cluster as long as the
density (number of objects or data points) in the “neighborhood” exceeds some threshold.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
11. Which of the following mentioned clustering methods quantize the object space into a finite number of
cells that form a grid structure.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
12. The k-means method is not guaranteed to converge to the global optimum and often
terminates at a local optimum.
a. True
b. False
13. The Partitioning Around Medoids (PAM) algorithm is a popular realization of ________________
clustering.
a. K Means
b. K-medoids
c. Extended- PAM
d. BIRCH
14. BIRCH stands for
a. Balanced Iterative Reducing and Clustering using Hierarchies
b. Balanced Iterative Reducing and Classification using Hierarchies
c. Balanced Iterative Regression and Clustering using Hierarchies
d. Balanced Iterative Regression and Classification using Hierarchies
15. DBSCAN stands for
a. Density-Based Classification Based on Connected Regions with High Density
b. Density-Based Classification Based on Connected Regions with Low Density
c. Density-Based Clustering Based on Connected Regions with High Density
d. Density-Based Clustering Based on Connected Regions with Low Density
16. A cluster is a collection of data objects that are______
a. similar to one another within the same cluster and similar to the objects in other clusters
b. dissimilar to one another within the same cluster and are dissimilar to the objects in other clusters
c. similar to one another within the same cluster and are dissimilar to the objects in other clusters
d. dissimilar to one another within the same cluster and are similar to the objects in other clusters
17. This method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on
how the hierarchical decomposition is formed
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
18. __________are the simplest form of outlier and the easiest to detect.
a. contextual outlier
b. collective outlier
c. conceptual outlier
d. global outliers
19. _________methods consult the neighborhood of an object, defined by a given radius. An object is an
outlier if its neighborhood does not have enough other points.
a. Clustering-based outlier detection
b. Classification-based outlier detection
c. Proximity-based outlier detection
d. Distance-based outlier detection
20. _______ methods assume that the normal data objects belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong to any clusters.
a. Clustering-based outlier detection
b. Classification-based outlier detection
c. Proximity-based outlier detection
d. Distance-based outlier detection
Question Bank : BECT406T: Data Warehousing & Mining (MCQ)

Unit 1
1) __________ is a subject-oriented,integrated, time-variant, nonvolatile collection of data in supportof
management decisions.
A.Data Mining.
B.Data Warehousing.
C.Web Mining.
D.Text Mining

2) __________ is the heart of the warehouse.


A)Data mining database servers.
B)Data warehouse database servers.
C) Data mart database servers.
D) Relational data base servers

3) Data can be updated in _____environment.


A) data warehouse.
B) data mining.
C) operational.
D) informational

4) The star schema is composed of __________ fact table.


A) one
B)two.
C) three.
D) four.

5) The source of all data warehouse data is the____________.


A) Operational environment.
B) Informal environment.
C) Formal environment.
D) Technology environment

6) Data warehouse contains_____________data that is never found in the operational environment.


A) normalized.
B) informational.
C) summary.
D)denormalized
7) An operational System
A) run the business in real time and is based on historical data
B) run the business in real time and is based on current data
C) used to support decision making and is based on current data.
D) support decision making and is based on historical data.

8) State whether the following statements about the three-tier data warehouse architecture are True or False.
i) OLAP server is the middle tier of data warehouse architecture.
ii) The bottom tier of data warehouse architecture does not include a metadata repository.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False

9) Data warehouses support _________________


A) OLTP
B) OLAP and OLTP
C) OLAP
D) Operational databases

10) __________describes the data contained in the data warehouse.


A). Relational data.
B). Operational data.
C). Metadata.
D). Informational data.

11. The … of the data warehouse architecture contains query and reporting tools, analysis tools, and data mining
tools.
A) bottom tier
B) middle tier
C) top tier
D) both B and C

12. Which of the following are the examples of gateways of the bottom tier of data warehouse architecture.
i) ODBC (Open Database Connection)
ii) OLEDB (Open-Linking and Embedding of Databases)
iii) JDBC (Java Database Connection)
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii

13. Back-end tools and utilities are used to feed data into the … from operational databases or other external sources.
A) bottom tier
B) middle tier
C) top tier
D) both A and B

14. From the architecture point of view, there are… data warehouse models.
A) two
B) three
C) four
D) five

15. A … contains a subset of corporate-wide data that is of value to a specific group of users.
A) primary warehouse
B) virtual warehouse
C) enterprise warehouse
D) data mart

16. A … is a set of views over operational databases.


A) primary warehouse
B) virtual warehouse
C) enterprise warehouse
D) data mart

17. State whether the following statements about the enterprise warehouse are True or False.
i) Enterprise warehouse contains details as well as summarized data.
ii) It provides corporate-wide data integration.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False

18. State whether the following statements about the OLTP system are True.
i) Clerk, database administrators, and database professionals are the users of the OLTP system.
ii) It is used on long-term informational requirements.
iii) It has a short and simple transaction.
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii

19. State whether the following statements about the OLAP system are True or False.
i) Knowledge workers such as managers, executive analysts are the users of the OLAP system.
ii) This system is used in day-to-day operations.
iii) The database size of the OLAP system will be 100GB to TB.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True

20. Multidimensional model of a data warehouse can exist in the form of the following schema.
i) Star Schema
ii) Snowflake Schema
iii) Fact Constellation Schema
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii

21. In the … the dimension tables displayed in a radial pattern around the central fact table.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema

22. The dimension tables of the … model can be kept in the normalized form to reduce the redundancies.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema
23. State whether the following statements about the fact constellation schema are True or False.
i) The fact constellation schema is also called galaxy schema.
ii) The fact constellation schema allows dimension tables to be shared between fact tables.
iii) This kind of schema can be viewed as a collection of snowflakes.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True

24. Which of the following are the different OLAP operations performed in the multidimensional data model.
i) Roll-up
ii) Roll-down
iii) Drill-down
iv) Slice
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv

25. When … operation is performed, one or more dimensions from the data cube are removed.
A) roll-up
B) roll-down
C) drill-down
D) drill-up

26. The … operation selects one particular dimension from a given cube and provides a new subcube.
A) drill
B) dice
C) pivot
D) slice

27. The … operation rotates the data axes in view in order to provide an alternative presentation of data.
A) drill
B) dice
C) pivot
D) slice

28. Which of the following are the different types of OLAP servers.
i) Relational OLAP
ii) Multidimensional OLAP
iii) Hybrid OLAP
iv) Specialized SQL Servers
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv

29. … servers allow storing a large data volume of detailed information.


A) Relational OLAP
B) Multidimensional OLAP
C) Hybrid OLAP
D) Specialized SQL Servers

30. Data that can be modeled as dimension attributes and measure attributes are called _______ data.
a) Multidimensional
b) Singledimensional
c) Measured
d) Dimensional

UNIT II
1. ...................... is an essential process where intelligent methods are applied to extract data patterns.
A) Data warehousing
B) Data mining
C) Text mining
D) Data selection

2. Data mining can also applied to other forms such as ................


i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
A) i, ii, iii and v only
B) ii, iii, iv and v only
C) i, iii, iv and v only
D) All i, ii, iii, iv and v

3. Which of the following is not a data mining functionality?


A) Characterization and Discrimination
B) Classification and regression
C) Selection and interpretation
D) Clustering and Analysis

4. ............................. is a summarization of the general characteristics or features of a target class of data.


A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

5. ............................. is a comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

6. Strategic value of data mining is ......................


A) cost-sensitive
B) work-sensitive
C) time-sensitive
D) technical-sensitive

7. ............................. is the process of finding a model that describes and distinguishes data classes or
concepts.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection

8. The various aspects of data mining methodologies is/are ...................


i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided mining.
iv) Handling uncertainty, noise, or incompleteness of data
A) i, ii and iv only
B) ii, iii and iv only
C) i, ii and iii only
D) All i, ii, iii and iv

9. The full form of KDD is ..................


A) Knowledge Database
B) Knowledge Discovery Database
C) Knowledge Data House
D) Knowledge Data Definition

10. The out put of KDD is .............


A) Data
B) Information
C) Query
D) Useful information

11.Which of the following is true for Classification?


a) A subdivision of a set
b) A measure of the accuracy
c) The task of assigning a classification
d) All of these

12.Which of the following is general characteristics or features of a target class of data?


a) Data selection
b) Data discrimination
c) Data Classification
c) Data Characterization

13.Which of the following statements is correct about data mining?


a.It can be referred to as the procedure of mining knowledge from data
b.Data mining can be defined as the procedure of extracting information from a set of the data
c.The procedure of data mining also involves several other processes like data cleaning, data transformation,
and data integration
d.All of the above

14.Which of the following can be considered as the classification or mapping of a set or class with some pre-
defined group or classes?
a. Data set
b. Data Characterization
c. Data Sub Structure
d. Data Discrimination

15.Which one of the following can be defined as the data object which does not comply with the general be-
havior (or the model of available data)?
a. Evaluation Analysis
b. Outlier Analysis
c. Classification
d. Prediction

16.Which one of the following statements is not correct about the data cleaning?
a.It refers to the process of data cleaning
b.It refers to the transformation of wrong data into correct data
c.It refers to correcting inconsistent data
d.All of the above

17.Which of the following is the correct advantage of the Update-Driven Approach?


a. This approach provides high performance.
b. The data can be copied, processed, integrated, annotated, summarized and restructured in the semantic
data store in advance.
c. Both A and B
d. None of the above

18.Which of the following correctly refers the data selection?


a. A subject-oriented integrated time-variant non-volatile collection of data in support of management
b. The actual discovery phase of a knowledge discovery process
c. The stage of selecting the right data for a KDD process
d. All of the above

19.Which of the following also used as the first step in the knowledge discovery process?
a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration

20.Which of the following refers to the steps of the knowledge discovery process, in which the several data
sources are combined?
a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration

21.The term "DMQL" stands for _____


a.Data Marts Query Language
b.DBMiner Query Language
c.Data Mining Query Language
d.None of the above

22. Which of these is correct about data mining?


a. It is a procedure in which knowledge is mined from data.
b. It involves processes like Data Transformation, Data Integration, Data Cleaning.
c. It is a procedure using which one can extract information out of huge sets of data.
d. All of the above

23.The issues of “Scalability and efficiency of the data mining algorithms” come under:
a. User Interaction and Mining Methodology Issues
b. Diverse Data Types Issues
c. Performance Issues
d. None of the above

24.The primary use of data cleaning is:


a. Removing the noisy data
b. Correction of the data inconsistencies
c. Transformations for correcting the wrong data
d. All of the above

25.The class under study in Data Characterization is known as:


a. Final Class
b. Target Class
c. InitialClass
d. Study Class

26.__________ means the description and trends or model regularities for those objects whose behavior
would change eventually over time.
a. Evolution Analysis
b. Outlier Analysis
c. Classification
d. Prediction

27.The issue of Pattern evaluation comes under which of these?


a. Performance Issues
b. Diverse Data Types Issues
c. User Interaction and Mining Methodology Issues
d. None of the above

28. The issue of “Handling complex and relational types of data” comes under:
a. User Interaction and Mining Methodology Issues
b. Diverse Data Types Issues
c. Performance Issues
d. None of the above

29.Multiple numbers of data sources get combined in which step of the Knowledge Discovery?
a. Data Transformation
b. Data Selection
c. Data Integration
d. Data Cleaning

30.Which of the following is correct application of data mining?


A. Market Analysis and Management
B. Corporate Analysis & Risk Management
C. Fraud Detection
D. All of the above

Unit 3:-Classification & Clustering


1) ____________ maps data into predefined groups
A).Regression
B) Time series analysis
C) Prediction
D) Classification

2) BIRCH is a ________
A) agglomerative clustering algorithm.
B)hierarchical algorithm.
C)hierarchical-agglomerative algorithm.
D) divisive.

3) Which of the following is a clustering algorithm?


A) priori.
B) CLARA.
C) Pincer-Search.
D) FP-growth

4) In ________ algorithm each cluster is represented by the center of gravity of the cluster.
A) k-medoid.
B) k-means.
C) STIRR
D) ROCK.

5) In ___________ each cluster is represented by one of the objects of the cluster located near the center.
A) k-medoid.
B) k-means.
C) STIRR.
D) ROCK.

6) Pick out a hierarchical clustering algorithm.


A) DBSCAN
B) BIRCH.
C.PAM.
D.CURE.

7)Which one of the following correctly defines the term cluster?


a. Group of similar objects that differ significantly from other objects
b. Symbolic representation of facts or ideas from which information can potentially be extracted
c. Operations on a database to transform or simplify data in order to prepare it for a machine-learning algo-
rithm
d. All of the above

8)Which one of the following refers to the binary attribute?


a. This takes only two values. In general, these values will be 0 and 1, and they can be coded as one bit
b. The natural environment of a certain species
c. Systems that can be used without knowledge of internal operations
d. All of the above
9)Which is needed by K-means clustering?
(A) defined distance metric
(B) number of clusters
(C) initial guess as to cluster centroids
(D)all of these

10)Which clustering technique requires a merging approach?


(A). Partitional
(B). Hierarchical
(C). Naive Bayes
(D). None of the mentioned

11)Which of the following clustering algorithm follows a top to bottom approach?


A)K-means
B)Divisive
C)Agglomerative
D)None

12)Which algorithm does not require a dendrogram?


A)K-means
B)Divisive
C)Agglomerative
D)All of above

13)What is a dendrogram?
A)A hierarchical structure
B)A diagram structure
C)A graph structure
D)None

14)Which of the following is not clustering method?


A)Density-Based
B) Hierarchical Based
C) Grid-based
D) Project Based

15)__________consider the clusters as the dense region having some similarity and different from the lower
dense region of the space
A) Density-Based
B) Hierarchical Based
C) Grid-based
D) None of these

16)Agglomerative has _________ approach


A) top down
B) bottom up
C) down up
D) None of these

17)Divisive has _________ approach


A) top down
B) bottom up
C) down up
D) None of these

18) A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possi-
ble consequences, including chance event outcomes, resource costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks

19)What is Decision Tree?


a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents outcome of test and
each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch repre-
sents outcome of test and each leaf node represents class label
d) None of the mentioned

20)Which of the following is NOT example of ordinal attributes? Select one:


a. Zip codes
b. Ordered numbers
c. Movie ratings
d. Military ranks

21) Identify the example of Nominal attribute Select one:


a. Temperature
b. Salary
c. Mass
d. Gender

22)Cluster analysis is ...................


a. Supervised learning
b.Unsupervised learning
c. Hybrid learning
d.Reinforcement learning

23)Which of the following refers to the problem of finding abstracted patterns (or structures) in the unlabeled
data?
a.Supervised learning
b.Unsupervised learning
c.Hybrid learning
d.Reinforcement learning

24)Which one of the following statements about the K-means clustering is incorrect?
a. The goal of the k-means clustering is to partition (n) observation into (k) clusters
b. K-means clustering can be defined as the method of quantization
c. The nearest neighbor is the same as the K-means
d. All of the above

25) Which one of the clustering technique needs the merging approach?
a. Partitioned
b. Naïve Bayes
c. Hierarchical
d. Both A and C

26) How do you choose the right node while constructing a decision tree?
(A) An attribute having high entropy
(B) An attribute having high entropy and information gain
(C) An attribute having the lowest information gain.
(D) An attribute having the highest information gain

27)The task of inferring a model from labelled training data is called


A) Unsupervised learning
B) Supervised learning
C) Reinforcement learning
D) Deep learning

28) In a decision tree, which of the following is used to represent a segment?


A) Root node
B) Leaf node
C) Interior nodes
D) Exterior nodes

29) The goal of clustering analysis is to:


a) Maximize the inter-cluster similarity
b) Maximize the intra-cluster similarity
c) Maximize the number of clusters
d) Minimize the intra-cluster similarity

30) In decision tree algorithms, attribute selection measures are used to


a) Reduce the dimensionality
b) Select the splitting criteria which best separate the data
c) Reduce the error rate
d) Rank attributes
UNIT IV Mining frequent patterns and Association Rules:
1) The non-root node of item-prefix-tree consists of ________ fields.
A) two.
B) three.
C) four.
D) five.

2)The paths from root node to the nodes labelled 'a' are called __________.
A)transformed prefix path.
B)suffix subpath.
C)transformed suffix path.
D) prefix subpath

3) The transformed prefix paths of a node 'a' form a truncated database of pattern which co-occurwith
a is called _______.
A)suffix path.
B)FP-tree.
C)conditional pattern base.
D) prefix path

4) . The number of iterations in apriori ___________


a. increases with the size of the data
b. decreases with the increase in size of the data
c. increases with the size of the maximum frequent set
d. decreases with increase in size of the maximum frequent set

5). Which of the following are interestingness measures for association rules?
a. recall
b. lift
c. accuracy
d. compactness

6). Frequent item sets is


a. Superset of only closed frequent item sets
b. Superset of only maximal frequent item sets
c. Subset of maximal frequent item sets
d. Superset of both closed frequent item sets and maximal frequent item sets
7) In Apriori algorithm, if 1 item-sets are 100, then the number of candidate 2 item-sets are a. 100
b. 4950
c. 200
d. 5000

8). Significant Bottleneck in the Apriori algorithm is


a. Finding frequent itemsets
b. Pruning
c. Candidate generation
d. Number of iterations
9). Which Association Rule would you prefer
a. High support and medium confidence
b. High support and low confidence
c. Low support and high confidence
d. Low support and low confidence

10). The FP-growth algorithm has ________ phases.


A. one.
B. two.
C. three.
D. four.

11). Which of the following is a predictive model?


A. Clustering.
B. Regression.
C. Summarization.
D. Association rules.

12). The basic idea of the apriori algorithm is to generate________ item sets of a particular size & scans the
database.
A. candidate.
B. primary.
C. secondary.
D. Superkey.

13). If an item set ‘XYZ’ is a frequent item set, then all subsets of that frequent item set are a. Undefined
b. Not frequent
c. Frequent
d. Can not say
14) A frequent pattern tree is a tree structure consisting of ________
A) an item-prefix-tree
B) a frequent-item-header table.
C) a frequent-item-node.
D) both A &B
15) Frequency of occurrence of an itemset is called as _____

(a) Support
(b) Confidence
(c) Support Count
(d) Rules

16) An itemset whose support is greater than or equal to a minimum support threshold is ______

(a) Itemset
(b) Frequent Itemset
(c) Infrequent items
(d) Threshold values

17 ) What does FP growth algorithm do?

(a) It mines all frequent patterns through pruning rules with lesser support
(b) It mines all frequent patterns through pruning rules with higher support
(c) It mines all frequent patterns by constructing a FP tree
(d) It mines all frequent patterns by constructing an itemsets
18) What techniques can be used to improve the efficiency of apriori algorithm?

(a) Hash-based techniques


(b)Transaction Increases
(c) Sampling
(d)Cleaning

19) What do you mean by support(A)?

(a) Total number of transactions containing A


(b) Total Number of transactions not containing A
(c) Number of transactions containing A / Total number of transactions
(d) Number of transactions not containing A / Total number of transactions

20) How do you calculate Confidence (A -> B)?


(a) Support(A ∩ B) / Support (A)
(b) Support(A ∩ B) / Support (B)
(c) Support(A ∪ B) / Support (A)
(d) Support(A ∪ B) / Support (B)
21) Which of the following is the direct application of frequent itemset mining?

(a) Social Network Analysis


(b) Market Basket Analysis
(c) Outlier Detection
(d) Intrusion Detection
22) What is not true about FP growth algorithms?

a) It mines frequent itemsets without candidate generation


(b) There are chances that FP trees may not fit in the memory
(c) FP trees are very expensive to build
(d) It expands the original database to build FP trees
23)

When do you consider an association rule interesting?

(a) If it only satisfies min_support


(b) If it only satisfies min_confidence
(c) If it satisfies both min_support and min_confidence
(d) There are other measures to check so
24 ) What is the relation between a candidate and frequent itemsets?

(a) A candidate itemset is always a frequent itemset


(b)A frequent itemset must be a candidate itemset
(c) No relation between these two
(d)Strong relation with transactions

25) Which of the following is not a frequent pattern mining algorithm?

(a) Apriori
(b) FP growth
(c) Decision trees
(d) Eclat

26) Which algorithm requires fewer scans of data?

(a) Apriori
(b)FP Growth
(c) Naive Bayes
(d)Decision Trees

27) For the question given below consider the data Transactions :

1. I1, I2, I3, I4, I5, I6


2. I7, I2, I3, I4, I5, I6
3. I1, I8, I4, I5
4. I1, I9, I10, I4, I6
5. I10, I2, I4, I11, I5

With support as 0.6 find all frequent itemsets?

(a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>
(b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>
(c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>
(d) <I1>, <I4>, <I5>, <I6>

28) What will happen if support is reduced?

(a) Number of frequent itemsets remains the same


(b) Some itemsets will add to the current set of frequent itemsets
(c) Some itemsets will become infrequent while others will become frequent
(d) Can not say
29) What is association rule mining?

(a) Same as frequent itemset mining


(b) Finding of strong association rules using frequent itemsets
(c) Using association to analyze correlation rules
(d) Finding Itemsets for future trends
30) A definition or a concept is ______ if it classifies any examples as coming within the concept

(a) Concurrent
(b) Consistent
(c) Constant
(d) Compete

Unit 5:-Web Data Mining


1.Web content mining describes the discovery of useful information from the _______contents.
A)text
B) web.
C) page.
D) level

2) _______ mining is concerned with discovering the model underlying the link structures of the web.
A) Data structure.
B) Web structure
C) Text structure
D) Image structure.

3) The ________ propose a measure of standing a node based on path counting.


A)open web.
B) close web.
C) link web.
D) hidden web.
4) In web mining, _______ is used to find natural groupings of users, pages, etc.
A) clustering.
B) associations.
C)sequential analysis.
D) classification.

5) In web mining, _________ is used to know the order in which URLs tend to be accessed.
A) clustering.
B) associations.
C) sequential analysis.
D.classification

6) In web mining, _________ is used to know which URLs tend to be requested together.
A.clustering.
B.associations.
C.sequential analysis.
D.classification.

7) __________ describes the discovery of useful information from the web contents.
A)Web content mining.
B) Web structure mining.
C) Web usage mining.
D) All of the above.

8)_______ is concerned with discovering the model underlying the link structures of the web
A) Web content mining.
B) Web structure mining
C) Web usage mining.
D) All of the above
9)A link is said to be _________ link if it is between pages with different domain names.
A) intrinsic.
B) transverse.
C) direct.
D) contrast.

10) A link is said to be _______ link if it is between pages with the same domain name.
A) intrinsic.
B) transverse.
C) direct.
D) contrast.
11) Hierarchical, Partitioning, Grid-based and density based methods are the methods of

Clustering

Classification

Association

Outlier Detection
12. Web structure mining is the process of discovering ____ information from the web

Semi structured

Unstructured

Structured

None of the above

13. Web mining - is the application of _______

Data Mining

Text Mining

Both a and b

None of these
14. Select non predictive data mining technique from below options

Summarization
Classification

Regression

Time Series Analysis


15. K-means is an example of

Association rule

Clustering

Regression

Classification
16. PageRank is a metric for ________documents based on their quality

- ranking hypertext
- ranking document structure
- ranking web content
- None of these
17. Select non descriptive data mining technique from options below
Options
- Clustering
- Summarization
- Sequence Discovery
- Classification
18. Select non predictive data mining technique from below options
Options
- Summarization
- Classification
- Regression
- Time Series Analysis
19. In data mining, Data objects that do not comply with general behavior or model of the data are called as
Options
- Clusters
- Centroids
- Outliers
- None of these
20. Web usage mining refers to the discovery of user access patterns from Web usage logs

True

False
21. BIRCH stands for
Balanced Interactive Regression and Clustering using Hierarchies

Balanced Iterative Reducing and classification using Hierarchies

Balanced Iterative Reducing and Clustering Using Hierarchies

Balanced Interactive Reducing and Clustering using Hierarchies


22. Out of the following processes which is not the process in data preprocessing?

Data cleaning

Data Reduction

Regression

Data Loading
23. In data mining, Data objects that do not comply with general behavior or model of the data are called as

Clusters

Centroids

Outliers

None of these
24. Web Server Data includes ________

IP address,

page reference

access time

All of the Above


25. The main purpose for structure mining is to extract previously unknown relationships between

Web pages

Web hyperlinks

Web data

Web contents
UNIT VI Big data Analytics
1. Hadoop is a framework that works with a variety of related tools. Common cohorts include ____________
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet

2. What was Hadoop named after?


a) Creator Doug Cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop development

3. __________ can best be described as a programming model used to develop Hadoop-based applications
that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned

4. Facebook Tackles Big Data With _______ based on Hadoop.


a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’

5. What are the five V’s of Big Data?


a) Volume
b) Velocity
c) Variety
d) All the above

6.Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client
applications submit MapReduce jobs.
A. MapReduce
B. Google
C. Functional Programming
D. Facebook
7. ________ is a platform for constructing data flows for extract, transform, and load (ETL) processing and
analysis of large datasets.
A. Pig Latin
B. Oozie
C. Pig
D. Hive

8. According to analysts, for what can traditional IT systems provide a foundation when they’re integrated
with big data technologies like Hadoop?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data

9. What are the different features of Big Data Analytics?


A. Open Source
B. Data Recovery
C. Scalability
D. all of the above

10. What are the main components of Big Data?


A. MapReduce
B. HDFS
C. YARN
D. all the above

11. Data in ____ bytes size is called Big Data.

A. Tera
B. Giga
C. Peta
D. Meta

12. Unprocessed data or processed data are observations or measurements that can be expressed as
text, numbers, or other types of media.

A. True
B. False

13. In computers, a ____ is a symbolic representation of facts or concepts from which information
may be obtained with a reasonable degree of confidence.

A. Data
B. Knowledge
C. Program
D. Algorithm

14. In Big Data environments, Velocity refers –

A. Data can arrive at fast speed


B. Enormous datasets can accumulate within very short periods of time
C. Velocity of data translates into the amount of time it takes for the data to be processed
D. All of the mentioned above

15. In Big Data environments, Variety of data includes –

A. Includes multiple formats and types of data


B. Includes structured data in the form of financial transactions,
C. Includes semi-structured data in the form of emails and unstructured data in the form of images
D. All of the mentioned above

16. In Big Data environment, Veracity of data refers -

A. Quality or fidelity of data


B. Large size of the data that cannot be process
C. Small size of the data that can easily process
D. All of the mentioned above

17. Virtualization separates resources and services from the underlying physical delivery environment.

A. True
B. False

18. What is a Virtual Machine (VM)?

A. Virtual representation of a physical computer


B. Virtual representation of a logical computer
C. Virtual System Integration
D. All of the mentioned above

19. In the given Virtual Architecture, name the missing layer,

A. Virtualization layer
B. Storage layer
C. Abstract layer
D. None of the mentioned above

20. MongoDB is a ____ database.

A. SQL
B. DBMS
C. NoSQL
D. RDBMS

21. MongoDB support cross platform and is written in _____ language.

A. Python
B. C++
C. R
D. Java

22. Amongst which of the following is / are true to run MongoDB?

A. High availability through built-in replication and failover


B. Management tooling for automation, monitoring, and backup
C. Fully elastic database as a service with built-in best practices
D. All of the mentioned above

23. Big data deals with high-volume, high-velocity and high-variety information assets,

A. True
B. False

24. _____ hypervisor runs directly on the underlying host system. It is also known as "Native Hypervi-
sor" or "Bare metal hypervisor".

A. TYPE-1 Hypervisor
B. TYPE- 2 Hypervisor
C. Both A and B
D. None of the mentioned above

25. ____ is also known as "Hosted Hypervisor".

A. TYPE-1 Hypervisor
B. TYPE- 2 Hypervisor
C. Both A and B
D. None of the mentioned above

26. In the layered architecture of Big Data Stack, Interfaces and feeds,

A. Internally managed data


B. Data feeds from external sources.
C. It provides access to each and every layer & components of big data stack
D. All of the mentioned above

27. _____ is the supporting physical infrastructure is fundamental to the operation and scalability of
big data architecture.

A. Redundant physical infrastructure


B. Integrated System
C. Integrated Database
D. All of the mentioned above

28. The physical infrastructure of a big data is based on a distributed computing model.

A. True
B. False

29. Security infrastructure refers the data about your constituents needs to be protected to ____.

A. Meet compliance requirements


B. Protect the privacy
C. Both A and B
D. None of the mentioned above
30. Reporting and visualization enables.

A. Processing of data
B. User friendly representation
C. Both A and B
D. None of the mentioned above

31. Data interpretation refers -

A. Process of attaching meaning to the data


B. Convert text into insightful information
C. Effective conclusion
D. All of the mentioned above

32. The significance of metadata is to provide information about a dataset’s characteristics and struc-
ture.

A. True
B. False

33. Data throttling refers to the performance of a solution is throttled,

A. True
B. False

34. Which of the following are Benefits of Big Data Processing?

A. Cost Reduction
B. Time Reductions
C. Smarter Business Decisions
D. All of the mentioned above

35. Amongst which of the following is/are not Big Data Technologies?

A. Apache Hadoop
B. Apache Spark
C. Apache Kafka
D. Apache Pytarch
1. Information can be converted into knowledge about ___ patterns and future trends.
Ans: Historical

2. Data about data is called ___.


Ans: Metadata

3. Facts, numbers, or text is called ___.


Ans: Data

4. ___ and ___ are the key to emerging Business Intelligence technologies.
Ans: Data warehouse and data mining

5. Data mining is also called ___.


Ans: Knowledge discovery

6. Online Analytical Processing (OLAP) is a technology that is used to create ___ software.
Ans: Decision support

7. OLAP Supports ___ user access and multiple queries.


Ans: Multiple

8. Statistics techniques are incorporated into Data mining methods. (True/False).


Ans: True

9. ___ Optimization techniques are based on the concepts of genetic combination, mutation, and natural selection.
Ans: Genetic algorithms

10. What is Mineset?


Ans: MineSet is software that provides tools for searching, sorting, filtering and drilling down enabling previously
complex data models to be viewed intuitively through real-time 3-D graphical representation.

11. A data warehouse refers to a database that is maintained separately from an organization’s operational databases.
(True/False)
Ans: True

12. A data warehouse is usually constructed by integrating multiple heterogeneous sources. (True/False)
Ans: True

13. ___ system is customer-oriented and is used for transaction and query processing by clerks, clients, and
information technology professionals.
Ans: OLTP

14. A ___ allows data to be modelled and viewed in multiple Dimensions.


Ans: Data cube

15. In ___ schema some dimension tables are normalized, thereby further splitting the data into additional tables.
Ans: Snowflake

16. The ___ data model is commonly used in the design of relational databases.
Ans: Entity-relationship

17. Data warehouses and OLAP tools are based on ___ data model.
Ans: Multidimensional
18. The ___ exposes the information being captured, stored, and managed by operational systems.
Ans: Data source view

19. ___ are the intermediate servers that stand in between a relational back – end server and client front – end
tools.
Ans: Relational OLAP (ROLAP) servers

20. A ___ is a set of views over operational databases.


Ans: Virtual warehouse

21. The ___ software gives the user the opportunity to look at the data from a variety of different dimensions.
Ans: Multidimensional Analysis

22. Which of the following statements defines Business Intelligence?


A. Converting data into knowledge and making it available throughout the organization
B. Analytical software and solutions for gathering, consolidating, analyzing and providing access to information in a
way that is supposed to let the users of an enterprise make better business decisions.
C. Both A & B
Ans: C. Both A & B

23. Based on the overall requirements of business intelligence, the ___ layer is required to extract, cleanse and
transform data into load files for the information warehouse.
Ans: Data integration

24. Data Mining is not a business solution; it is just a technology. (True/False)


Ans: True

25. ___ is a random error or variance in measured variables.


Ans: Noise

26. State true or false


I. BI applications can also help managers to be better informed about actions that a company’s competitors are
taking
II. BI can help companies share selected strategic information with business partners.
III. BI 2.0″ is used to describe the acquisition, provision and analysis of “real-time” data
A. i-T, ii-F, iii-F
B. i-T, ii-T, iii-F
C. i-T, ii-F, iii-T
D. i-T, ii-T, iii-T
Ans: D.

27. ___ routines attempt to fill in missing values, smooth out noise while identifying outlines, and correct
inconsistencies in the data.
Ans: Data cleaning

28. ___ is used to refer to systems and technologies that provide the business with the means for decision-makers
to extract personalized meaningful information about their business and industry.
Ans: Business Intelligence

29. In ___ each value in a bin is replaced by the mean value of the bin.
Ans: Smoothing by bin means
30. ___ regression involves finding the “best” line to fit two variables so that one variable can be used to predict
the other.
Ans: Linear

31. ___ works to remove the noise from the data that includes techniques like binning, clustering, and regression.
Ans: Smoothing

32. Redundancies can be detected by correlation analysis. (True/False)


Ans: True

33. The ___ technique uses encoding mechanisms to reduce the data set size.
Ans: Data compression

34. In which Strategy of data reduction redundant attributes are detected.


A. Date cube aggregation
B. Numerosity reduction
C. Data compression
D. Dimension reduction
Ans: D. Dimension reduction

35. ___ hierarchies can be used to reduce the data by collecting and replacing low-level concepts by higher-level
concepts.
Ans: Concept

36. The ___ rule can be used to segment numeric data into relatively uniform, “natural” intervals.
Ans: 3-4-5

37. Oracle, SQL/Server, DB2 are examples for ___.


Ans: DBMS

38. Data Base Management System (DBMS) supports query languages. (True/False)
Ans: True

39. The ___ item sets find all sets of items (items sets) whose support is greater than the user-specified minimum
support, σ.
Ans: Frequent set

40. A frequent set is a ___ if it is a frequent set and no superset of this is a frequent set.
Ans: Maximal frequent set

41. ___ techniques are used to detect relationships or associations between specific values of categorical variables
in large data sets.
Ans: Association rule mining

42. A Decision Tree is a ___ model.


Ans: Predictive model

43. Using a decision tree, only categorical variables would be modelled. (True/False).
Ans: False

44. Clustering is an unsupervised learning method (True/false).


Ans: False
45. Neural networks are made up of many ___.
Ans: Artificial neurons

46. For a given transaction database T, a ___ is an expression of the form X => Y, where X and Y are subsets of
A and X => Y holds with confidence Ʈ, if Ʈ% of transactions in D support X also support Y.
Ans: Association rule

47. The ___ rule describes associations between quantitative items or attributes.
Ans: Quantitative association

48. The ___ step eliminates the extensions of (k-1) – itemsets, which are not found to be frequent, from being
considered for counting support.
Ans: Pruning

49. In the first phase of the Partition algorithm, the algorithm logically divides the database into a number of ___.
Ans: non – overlapping partitions.

50. The a priori algorithm operates in a ___ and ___.


Ans: bottom-up, breadth-first search method.

51. ___ algorithm works like a train running over the data, with stops at intervals M between transactions. When
the train reaches the end of the transaction file it completes one path.
Ans: DIC Algorithm

52. FP–Tree Growth Algorithm can be implemented in ___ Phases.


Ans: Two

53. FP – tree stands for ___.


Ans: Frequent pattern tree

54. Data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and
traverse easily among different abstraction spaces (True/False).
Ans: True

55. Which one of the following is alternative search strategies for mining multiple-level associations with reduced
support?
a) Level – by level independent
b) Level – cross-filtering by a single item
c) Level – cross-filtering by k – itemset:
d) All the above
Ans: d) All the above

56. Which of the following is NOT a common binning strategy?


a) Equiwidth binning,
b) Equidepth binning,
c) Homogeneity – based binning,
d) Equilength binning
Ans: d) Equilength binning

57. Association rules that involve two or more dimension or predicates can be referred to as ___.
Ans: Multidimensional association rules.

58. An algorithm that performs a series of “walks” through itemset space is called a ___.
Ans: Random walk algorithm.

59. What are knowledge type constraints?


Ans: They specify the type of knowledge to be mined.
60. A standard measure of within-cluster similarity is ___.
Ans: variance

61. The process of grouping a set of physical or abstract objects into classes of similar objects is called ___.
Ans: Cluster

62. Clustering may also be considered as ___.


Ans: Segmentation

63. Clustering is also called:


a. Segmentation
b. Compression
c. Partitions with similar objects
d. All the above
Ans: d. All the above

64. Clustering is used only in data mining (True/False).


Ans: True

65. Clustering is a form of learning by observation rather than ___.


Ans: By example

66. Weight and height of an individual fall into ___ kind of variables.
Ans: Continuous

67. In the K-means algorithm for partitioning, each cluster is represented by the ___ of objects in the cluster.
Ans: Means

68. K-means clustering requires prior knowledge about number clusters required as its input.(True/False).
Ans: True

69. One form of unsupervised learning is ___.


Ans: Clustering

70. ___ software provides a set of partitioned clustering algorithms that treat the clustering problem as an
optimization process.
Ans: CLUTO

71. Data classification is a ___ step process.


Ans: Two

72. ___ can be viewed as the construction and use of a model to assess the class of an unlabeled sample, or to
assess the value or value ranges of an attribute that a given sample is likely to have.
Ans: Prediction

73. ___ of data removes or reduces noise (by applying smoothing techniques) and the treatment of missing values.
Ans: Pre-processing

74. ___ method refers to the ability to construct the model efficiently given a large amount of data.
Ans: Scalability

75. What is a decision tree?


Ans: This is a flow – chart – like a tree structure, where each internal node denotes a test on an attribute, each
branch represents an outcome of the test, and leaf nodes represent classes or class distributions.

76. The basic algorithm for decision tree induction is a ___ algorithm.
Ans: greedy

77. The ___ measure is used to select the test attribute at each node in the tree.
Ans: information gain

78. A user session is a ___ record spanning the entire Web.


Ans: Clickstream record

79. ___ is simple text files that are automatically generated every time someone accesses one Website.
Ans: Log File

80. ___ files are frequently used in sequential mining.


Ans: Web log files

81. ___ is used to examine the structure of a particular website and collate and analyze related data.
Ans: Structural mining

82. Which of the following techniques are concerned about user navigation accessing?
a. Web structural mining
b. Web usage mining
c. Web content mining
d. Web data definition mining
Ans: b. Web usage mining

83. Web data is ___.


a. Structured data
b. Un-structured data
c. Only text data
d. Binary data
Ans: b. Un-structured data

84. ___ Web mining involves the development of Sophisticated Artificial Intelligence systems.
Ans: an agent-based approach

85. The ___ approaches to Web mining have generally focused on techniques for integrating and organizing the
heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources.
Ans: database

86. Association rules involving multimedia objects can be mined in ___ and ___ databases.
Ans: Image and video

87. In ___ approach, the signature of an image includes color histograms based on the color composition of an
image regardless of its scale or orientation.
Ans: Color histogram-based signature

88. Which of the following are the measures of the text retrieval documents?
a. Precision
b. Recall
c. F-score
d. a,b,c
Ans: d. a,b,c

89. Data stored in most text databases are ___.


Ans: Semi-structured

90. Which of the following is the first step in text retrieval systems?
a. Stemming
b. Term words finding
c. Tokenization
d. Replacing the null data with keywords
Ans: c. Tokenization

91. Which of the following are the stop words?


a. A
b. The
c. of
d. a,b,c
Ans: d. a,b,c

92. Text databases are also called ___.


Ans: Document databases

93. Insurance and direct mail are two industries that rely on ___ to make profitable business decisions.
Ans: data analysis

94. To aid decision-making, analysts construct ___ models using warehouse data to predict the outcomes of a
variety of decision alternatives.
Ans: predictive

95. A ___ profile is a model that predicts the future purchasing behaviour of an individual customer, given historical
transaction data for both the individual and for the larger population of all of a particular company’s customers.
Ans: predictive

96. Data mining can be used to help predict future patient behaviour and to improve treatment programs (True/False).
Ans: True

98. Data mining in the telecommunication industry helps to understand the business involved, identify
telecommunication patterns (True/False).
Ans: True

99. GDP stands for ___.


Ans: gross domestic product

100. ___ is proving to be a critical link between theory, simulation, and experiment.
Ans: data-intensive computing

101. IDS are based on ___ that are developed by the manual encoding of expert knowledge.
Ans: Handcrafted signatures

102. Choose the correct option.


Data mining can be used to improve ___.
a) Efficiency
b) Quality of data
c) Marketing
d) All the above
Ans: D. All the above.

103. To improve accuracy, data mining programs are used to analyze audit data and extract features that can
distinguish normal activities from intrusions. (True/False)
Ans: True

104. Data mining-based IDSs (especially anomaly detection systems) have higher false-positive rates than traditional
handcrafted signature-based methods. (True/False)
Ans: True
105. ___ is a new class of intrusion detection algorithms that do not rely on labelled data.
Ans: Unsupervised anomaly detection

106. ___ algorithm uses the frequency distribution of each feature’s values to proportionally generate a sufficient
amount of anomalies.
Ans: Distribution Based Artificial Anomaly

107. OLAP typically includes the following kinds of analyses: simple, comparison, trend, ___ and ___.
Ans: Variance and ranking

108. Patient Rule Induction Method (PRIM) and Weighted Item Sets (WIS), is a type of ___ technique.
Ans: Association rule

109. ___ tools cannot discover high average regions or find new patterns in data.
Ans: OLAP

110. ___ method is useful for finding patterns or associations between attributes.
Ans: WIS
KDK COLLEGE OF ENGINEERING, NAGPUR
8TH SEM
D.W.M.
MCQ QUESTION BANK

1. An itemset whose support is greater than or equal to a minimum support threshold is.......................
Option A. itemset
Option B. Frequent itemset
Option C. Threshold values
Option D. None of these
Answer: B

2. The process that analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”
Option A. frequent Item set mining
Option B. Market Basket Analysis
Option C. FP growth
Option D. Predictive analysis
Answer : B

3. Which one manages both current and historic transactions?


A) OLTP B) OLAP C) Spread sheet D) XML
Answer: B

4. When is an association rule considered interesting?


Option A. If it only satisfies min_support
Option B. If it only satisfies min_confidence
Option C. If it satisfies both min_support and min_confidence
Option D. There are other measures to check so
Answer : C

5. Frequent itemset mining leads to the discovery of associations and correlations among items in large
transactional or relational data sets.
Option A. True
Option B. False
Answer : A

6. Sequential pattern mining has focused extensively on mining _____________.


Option A. Symbolic sequences
Option B. Symbolic Name
Option C. Symbolic Pattern
Option D. Symbolic form
Answer : A

7. The temporal data which are free of any temporal reference


Option A. Static
Option B. Sequences
Option C. Timestamped
Option D. None of these
Answer : A

8. The full form of OLAP is

A)Online Analytical Processing B) Online Advanced Processing


C) Online Advanced Preparation D) Online Analytical Performance

Answer : A

9. The full form of KDD is ..................


A) Knowledge Database B) Knowledge Discovery Database
C) Knowledge Data House D) Knowledge Data Definition
Answer: B

10. A partitioning method available for improving the efficiency of the algorithm requires just _____ database
scans to mine the frequent itemset.
Option A. one
Option B. two
Option C. three
Option D. none of the above
Answer : B

11. Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good
performance gain. However, it can suffer from some nontrivial costs also.
Option A. True
Option B. False
Answer : A

12. Text mining usually requires structuring the _________.


Option A. Input text
Option B. Output text
Option C. No text
Option D. None of these
Answer : A

13. Listed below are the three steps that are followed to deploy a Big Data Solution except
Option A. Data Ingestion
Option B. Data Processing
Option C. Data dissemination
Option D. Data Storage
Answer : C

14. The benefits of Big Data Processing is/are


Option A. Businesses can utilize outside intelligence while taking decisions
Option B. Improved customer service
Option C. Better operational efficiency
Option D. All of the above
Answer : D

15. __________ is an interdisciplinary field that draws on information retrieval, data mining, machine learning,
statistics, and computational linguistics.
Option A. Data mining
Option B. Web mining
Option C. Text mining
Option D. None of these
Answer : C

16. The techniques that can be used to improve the efficiency of Apriori algorithm is/are
Option A. hash based techniques
Option B. transaction reduction
Option C. Partitioning
Option D. All of these
Answer: D

17. Web mining is the application of _____________ to discover patterns, structures, and knowledge from
the Web.
Option A. data mining classification
Option B. data mining application
Option C. data mining features
Option D. data mining techniques
Answer : D

18. A typical example of frequent itemset mining


Option A. Social Network Analysis
Option B. Market Basket Analysis
Option C. Outlier detection
Option D. Intrusion detection
Answer: B

19. An .................. system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.

A) OLAP B) OLTP C) Both of the above D) None of the above

Answer: A

20. Frequency of occurrence of an itemset is called as _____


Option A. Support
Option B. Confidence
Option C. Support Count
Option D. Rules
Answer: C

21. Frequent pattern mining can be classified in various ways, based on the following criteria :
Option A. Based on the completeness of patterns to be mined
Option B. Based on the levels of abstraction involved in the rule set
Option C. Based on the number of data dimensions involved in the rule
Option D. All of these
Answer : D

22. ________ method(s) transforms the problem of finding long frequent patterns to searching for shorter ones
recursively and then concatenating the suffix.
Option A. The FP-growth
Option B. appriori
Option C. Vertical data format
Option D. All of these
Answer : A

23. web content mining, web structure mining, and web usage mining these are the main areas of _________.
Option A. Text mining
Option B. web mining
Option C. Both a and b
Option D. None of these
Answer : B

24. The form of data having an associated time interval during which it is valid , is known as
Option A. Temporal data
Option B. Snapshot data
Option C. Point in time data
Option D. None of these
Answer : A

25. The main purpose for structure mining is to extract previously unknown relationships between
Option A. Web pages
Option B. Web hyperlinks
Option C. Web data
Option D. Web contents
Answer : A

26. ........................ is a good alternative to the star schema.

A) Star schema B) Snowflake schema C) Fact constellation D) Star-snowflake schema


Answer:C

27. Fact table are………………

A) Completely demoralized B) Partially demoralized

C) Completely normalized D) partially normalized

Answer: C

28. What is incorrect about FP growth algorithms?


Option A. It mines frequent itemsets without candidate generation
Option B. There are chances that FP trees may not fit in the memory
Option C. FP trees are very expensive to build
Option D. It expands the original database to build FP trees
Answer : D

29. Which of the following are interestingness measure for association rules?
Option A. recall
Option B. lift
Option C. accuracy
Option D. compactness
Answer : B
30. Web mining can be organized into _______ main areas.
Option A. One
Option B. Two
Option C. Three
Option D. Four
Answer : C

31. The simple text files that are automatically generated every time someone accesses one Website are
Option A. Multimedia files
Option B. Text files
Option C. Log Files
Option D. None of these
Answer : C

32. _________discovers implicit and useful knowledge from large data sets using data and/or knowledge
visualization techniques.
Option A. Text Data mining
Option B. Web mining
Option C. Visual data mining
Option D. Spatial data mining
Answer : C

33. What is Apriori property?


Option A. All nonempty subsets of a frequent itemset must also be frequent
Option B. All empty subsets of a frequent itemset must also be frequent
Option C. All nonempty subsets of a frequent itemset must not be frequent
Option D. All of these
Answer: A

34. __________ are data that relate to both space and time.
Option A. Spatial data
Option B. Spatiotemporal data
Option C. Temporal data
Option D. None of these
Answer : B

35. Web structure mining is the process of discovering ____ information from the web.
Option A. Semi structured
Option B. Unstructured
Option C. Structured
Option D. None of these
Answer : C

36. The examination of large amounts of data to see what patterns or other useful information can be found is
known as
Option A. Data examination
Option B. Information analysis
Option C. Big data analytics
Option D. Data analysis
Answer : C

37. The new source of big data that will trigger a big data revolution in the years to come is
Option A. Business transactions
Option B. Social Media
Option C. Transactional data and sensor data
Option D. RDBMS
Answer : C

38. Which is general-purpose computing model and runtime system for distributed data analytics?
Option A. MapReduce
Option B. Drill
Option C. Oozie
Option D. None of these
Answer : A

39. ___________ discovers patterns and knowledge from spatial data.


Option A. Spatial data mining
Option B. Text data mining
Option C. web data mining
Option D. Audio data mining
Answer : A

40. Apache kafka is an open source platform that was created by


Option A. LinkedIn
Option B. Facebook
Option C. Google
Option D. IBM
Answer : A

41. Which of the following is not a component of a data warehouse?


A) Metadata B) Current detail data C) Lightly summarized data D) Component Key

Answer:D

42. Star Schema is composed of …………….. fact table

A) One B) Two C) Three D) Four

Answer:A

For questions 43-46, use the following data set: 2, 4, 6, 8, 10, 12

43. Determine the range of the data.


A) 2 B) 8 C) 10 D) 12
Answer:C

44. Determine the median of the data.

A) 6 B) 7 C) 8 D) 9
Answer:B

45. Determine the midrange of the data.

A) 4 B) 6 C) 8 D)10
Answer:B

46. Determine the mean deviation of the data.

A) 3 B) 3.74 C) 6 D)18

Answer:A

47. The type of relationship in star schema is ...............


A) many to many B) one to one C) one to many D) many to one

Answer:C

48. computer =>antivirus software [support = 2%; confidence = 60%] A support of 2% in above Association rule
describe
Option A. 2% of all the transactions under analysis show that computer and antivirus software are purchased
together.
Option B. 2% of all the transactions under analysis show that computer or antivirus software may purchased
together.
Option C. 2% of the customers who purchased a computer also bought the software
Option D. 2% of the customers purchased a computer
Answer : A

49. ____________ integrates data mining and data visualization to discover implicit and useful knowledge from
large data sets.
Option A. Audio data mining
Option B. Video data mining
Option C. Text data mining
Option D. Visual data mining
Answer : D

50. The information gathered through web mining is evaluated by


Option A. Clustering
Option B. Classification
Option C. Association
Option D. All of above
Answer : D
51. Sequential pattern mining searches for frequent substructures in a structured data set.
Option A. True
Option B. False
Answer : B

52. ____________ is the process of extracting useful information (e.g., user click streams) from server logs.
Option A. Audio mining
Option B. Data mining
Option C. Web usage mining
Option D. Text mining
Answer : C

53. The data mining algorithm used by Google Search to rank web pages in their search engine results, is
Option A. K-means Algorithm
Option B. PageRank Algorithm
Option C. Naive Bayes Algorithm
Option D. Adaboost Algorithm
Answer : B

54. Facebook tackles big data with________________________based on Hadoop


Option A. Project Prism
Option B. Prism
Option C. Project Big
Option D. Project Data
Answer : A

55. Which of the following is not feature of Hadoop?


Option A. Suitable for Big data analysis
Option B. scalability
Option C. Robust
Option D. Fault Tolerance
Answer : C

56. Which of the following are incorrect Big Data Technologies?


Option A. Apache Hadoop
Option B. Apache Spark
Option C. Apache Kafka
Option D. Apache Pytarch
Answer : D

57. Which of the following platforms does Hadoop run on?


Option A. Bare metal
Option B. Debian
Option C. Cross-platform
Option D. Unix-Like
Answer : C
58. Opinion extraction from online sources is a ___________________ task
Option A. Web content mining
Option B. Web Usage mining
Option C. Web structure mining
Option D. None of these
Answer : A

59. Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset
format (that is, {TID : itemset}), where TID is a transaction-id and itemset is the set of items bought in
transaction TID.
Option A. True
Option B. False
Answer : A

60. .......................... is a subject-oriented, integrated, time-variant, nonvolatile collection or data in


support of management decisions.

A) Data Mining B) Data Warehousing C) Document Mining D) Text Mining

Answer: B

61. Expansion for DSS in DW is……….

A) Decisions Support System B) Decision Single System

C) Data Suitable System D) Data Support System

Answer: A

62. What does FP growth algorithm do?


Option A. It mines all frequent patterns by constructing a FP tree
Option B. It mines all frequent patterns through pruning rules with higher support
Option C. It mines all frequent patterns through pruning rules with lesser support
Option D. All of these
Answer: A

63. Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf)
are called as ______ rule
Option A. strong
Option B. weak
Option C. primary
Option D. none of the above
Answer : A

64. ___________uses audio signals to indicate the patterns of data or the features of data mining results.
Option A. Audio data mining
Option B. Visual data mining
Option C. Web mining
Option D. Text Data mining
Answer : A
65. A user session is a ___ record spanning the entire Web.
Option A. Log
Option B. Clickstream
Option C. Web log
Option D. None of these
Answer : B

66. The MapReduce algorithm contains two important tasks namely


Option A. mapped, reduce
Option B. mapping, reduction
Option C. map, reduction
Option D. map , reduce
Answer : D

67. Which of the following fields come under the umbrella of big data?
Option A. Black Box Data
Option B. Power Grid Data
Option C. Search Engine Data
Option D. All of the above
Answer : D

68. Which of the following is not a kind of data warehouse application?


A) Information processing B) Analytical processing
C) Data mining D) Transaction processing

Answer: D

69. In ______ , the number of transactions scanned in future iterations are reduced.
Option A. Transaction reduction
Option B. Partitioning
Option C. Sampling
Option D. None of these
Answer : A

70. __________ is a frequent subsequence existing in a single sequence or a set of sequences.


Option A. A sequential rule
Option B. A sequential pattern
Option C. A frequent pattern
Option D. None of these
Answer : B

71. Who has the world’s largest Hadoop cluster?


Option A. Apple
Option B. Datamatics
Option C. Facebook
Option D. None of the mentioned
Answer : C
72. Who was the developer of Hadoop language?
Option A. Apache Software Foundation
Option B. Hadoop Software Foundation
Option C. Sun Microsystems
Option D. Bell Labs
Answer : A

73. ……………… is a subject oriented, integrated, time-variant, non volatile collection of data in support of
management decisions.

A)Data Mining B) Data Warehousing C) Web Mining D) text Mining

Answer: B

74. Adding candidate itemsets at different points during a scan is known as _________
Option A. Dynamic itemset counting
Option B. Partitioning
Option C. Dynamic itemset partitioning
Option D. none of the above
Answer : A

75. __________ include text categorization, text clustering, concept /entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and entity-relation modeling.
Option A. Data mining tasks
Option B. Text mining tasks
Option C. Web mining tasks
Option D. Video mining tasks
Answer : B

76. The main components of Big Data is/are


Option A. MapReduce
Option B. HDFS
Option C. YARN
Option D. All of the above
Answer : D

77. ..................... is an essential process where intelligent methods are applied to extract data
patterns.

A) Data warehousing B) Data mining C) Text mining D) Data selection

Answer: B

78. Sequential pattern mining searches for frequent subsequences in a sequence data set, where a sequence
records an ordering of events.
Option A. True
Option B. False
Answer : A

79. The set of closed graphs where a graph g is closed if there exists no proper ___________ g’ that carries
the same support count as g.
Option A. sub graph
Option B. no graph
Option C. super graph
Option D. none of these
Answer : C

80. 90% of the world's total data has been created just within the past two years. This statement is true or false?
Option A. True
Option B. False
Answer : A

81. According to analysts, for what can traditional IT systems provide a foundation when they’re integrated with
big data technologies like Hadoop?
Option A. Big data management and data mining
Option B. Data warehousing and business intelligence
Option C. Management of Hadoop clusters
Option D. Collecting and storing unstructured data
Answer : A

82. ------------------ techniques are concerned about user navigation accessing.


Option A. Web structural mining
Option B. Web usage mining
Option C. Web content mining
Option D. Web data definition mining
Answer : B

83. confidence(A => B) = P(B | A) = ?


Option A. support(A U B) / support(A)
Option B. support(A -> B) / support(A)
Option C. support(A) / confidence (A)
Option D. non of the above
Answer : A

84. The data Warehouse is……….

A) Read Only B) Write Only C) Read Write only D) None

Answer: A

85. Web data is ___


Option A. Structured data
Option B. Un-structured data
Option C. Only text data
Option D. Binary data
Answer : B

86. Concerning the Forms of Big Data, which one of these is odd?
Option A. Processed
Option B. Semi-structured
Option C. Structured
Option D. Unstructured
Answer : A

87. All of the following accurately describe Hadoop except


Option A. open source
Option B. real time
Option C. java based
Option D. distributed computing approach
Answer : B

88. Apriori is a seminal algorithm proposed in which of the following year?


Option A. 1990
Option B. 1992
Option C. 1994
Option D. 1998
Answer : C

89. The category of temporal data in which temporal information is explicit is


Option A. Static
Option B. Sequences
Option C. Timestamped
Option D. None of these
Answer : C

90. What is the relation between candidate and frequent itemsets?


Option A. A frequent itemset must be a candidate itemset
Option B. A candidate itemset is always a frequent itemset
Option C. No relation between the two
Option D. None of these
Answer : A

91. In FP-Growth we are finding frequent itemsets without candidate generation.


Option A. True
Option B. False
Answer : A

92. The different features of Big Data Analytics is/are?


Option A. Open-Source
Option B. Scalability
Option C. Data Recovery
Option D. All the above
Answer : D

93. In how many stages MapReduce program executes ?


Option A. 2
Option B. 3
Option C. 4
Option D. 5
Answer : B

94. What are two measures of rules of interestingness ?


Option A. support and confidence
Option B. maximun and minimum
Option C. frequent and closed
Option D. none of the above
Answer : A

95. ________ often also uses Word Net, Sematic Web, Wikipedia, and other information sources to enhance
the understanding and mining of text data.
Option A. Text mining
Option B. Data mining
Option C. Web mining
Option D. None of these
Answer : A

96. Data in________________________bytes sizes is called big data.


Option A. Tera
Option B. Giga
Option C. Peta
Option D. Meta
Answer : C

97. Which task takes the output from a map as an input and combines those data tuples into smaller set of tuples?
Option A. Map
Option B. Reduce
Option C. Node
Option D. None of these
Answer : B

98. What makes Big Data analysis difficult to optimize?


Option A. Big Data is not difficult to optimize
Option B. Both data and cost effective ways to mine data to make business sense out of it
Option C. The technology to mine data
Option D. All of the above
Answer : B

99. ____ is the ratio of the measure of an item when compared with that of its parent , its child , or its sibling in
frequent pattern analysis.
Option A. gradient
Option B. association
Option C. support
Option D. None of these
Answer : A
100. ---------------takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples
Option A. Map
Option B. Reduce
Option C. Node
Option D. none
Answer : A

101. Which of the following is not a frequent pattern mining algorithm?


Option A. Apriori
Option B. FP growth
Option C. Decision trees
Option D. Eclat
Answer: C

102.___________ analyzes web content such as text, multimedia data, and structured data (within web pages
or linked across web pages).
Option A. Web content mining
Option B. web mining
Option C. Web usage mining
Option D. Web structure mining
Answer : A

103.The feature of big data that refers to the quality of the stored data is ______
Option A. Variety
Option B. Volume
Option C. Variability
Option D. Veracity
Answer : D

104. Data can also be presented in item-TID set format (that is, {item : TID set}), where item is an item name,
and TID_set is the set of transaction identifiers containing the item. This format is known as __________
Option A. vertical data format.
Option B. horizontal data format
Option C. Parallel data format
Option D. none of the above
Answer : A

105.In how many forms Big data could be found?


Option A. 2
Option B. 3
Option C. 4
Option D. 5
Answer : B
106.Although the Hadoop framework is implemented in Java , Mapreduce applications need not be written in
Option A. c
Option B. c#
Option C. java
Option D. None
Answer : C

107.Transaction data of the bank is


Option A. structured data
Option B. unstructured data
Option C. Both
Option D. None
Answer : A

108. The data is stored, retrieved and updated in ....................

A) OLAP B) OLTP C) SMTP D) FTP

Answer: B

109.____________ is the process of using graph and network mining theory and methods to analyze the nodes
and connection structures on the Web.
Option A. web mining
Option B. Web structure mining
Option C. Web usage mining
Option D. Text mining
Answer : B

110. Big Data analysis does the following except


Option A. Collects data
Option B. Spreads data
Option C. Organizes data
Option D. Analyzes Data
Answer : D

111. DSS in data warehouse stands for _____________


A) Decision Single system
B) Decision support system
C) Data support system
D) Data Storable system
Answer: B

112. Identify the main characteristic of OLTP.


A) Provides advanced database support
B) Does not support client/server architecture
C) Uses single dimension data analysis technique
D) None
Answer: A

113. Why is the snowflake schema applied?


A) Transformation

B) Aggregation

C) Normalization

D) Generalization

Answer:C

114. K-means clustering consists of a number of iterations and not deterministic.


(A). True
(B). False

Answer: A

115. Which is needed by K-means clustering?


(A). defined distance metric
(B). number of clusters
(C). initial guess as to cluster centroids
(D). all of these

Answer: D

116. Which clustering technique requires a merging approach?


(A). Partitional
(B). Hierarchical
(C). Naive Bayes
(D). None of the mentioned
Answer: B

117. The problem of finding hidden structure in unlabeled data is called

A) Supervised learning

B) Unsupervised learning

C) Reinforcement learning

Answer: B

118. Task of inferring a model from labeled training data is called

A) Unsupervised learning

B) Supervised learning

C) Reinforcement learning
Answer: B

119. Some telecommunication company wants to segment their customers into distinct groups in order to send
appropriate subscription offers, this is an example of

A. Supervised learning

B. Data extraction

C. Serration

D. Unsupervised learning

Answer: D

120.Self-organizing maps are an example of


A. Unsupervised learning

B. Supervised learning

C. Reinforcement learning

D. Missing data imputation

Answer: A

121. You are given data about seismic activity in Japan, and you want to predict a magnitude of the next
earthquake, this is in an example of

A. Supervised learning

B. Unsupervised learning

C. Serration

D. Dimensionality reduction

Answer: A

122.Assume you want to perform supervised learning and to predict number of newborns according to size of storks’
population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of

A. Classification

B. Regression

C. Clustering

D. Structural equation modeling

Answer: B

You might also like