Microsoft Word - CS1634-Data Warehousing & Data Mining

1. What are the uses of statistics in data mining?
Statistics is used to
• Estimate the complexity of a data mining problem.
• Suggest which data mining techniques are most likely to be successful, and
• Identify data fields that contain the most “surface information”.
2. What is the main goal of statistics?
The basic goal of statistics is to extend knowledge about a subset of a collection to
the entire collection.
3. What are the factors to be considered while selecting the sample in statistics?
The sample should be
• Large enough to be representative of the population.
• Small enough to be manageable.
• Accessible to the sampler.
• Free of bias.
4. Name some advanced database systems?
• Object-oriented databases.
• Object-relational databases.
5. Name some specific application oriented databases?
• Spatial databases.
• Time-series databases.
• Text databases
• Multimedia databases.
6. Define Relational databases?
Relational databases are a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores a
large set of tuples (rows or records). Each tuple in a relational table represents an object
identified by a unique key and described by a set of attribute values.
7. Define Transactional databases?
Transactional databases consist of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity number
(trans_ID), and a list of the items making up the transaction.
1
8. Define Spatial Databases?
Spatial databases contain spatial-related information. Such databases include
geographic (map) databases, VLSI chip design databases, and medical and satellite image
databases. Spatial data may be represented in raster format, consisting of n-dimensional
bit maps or pixel maps.
9. What is Temporal Database?
Temporal database store time related data. It usually stores relational data that
include time related attributes. These attributes may involve several time stamps, each
having different semantics.
10. What is a Time-Series database?
A Time-Series database stores sequences of values that change with time, such as
data collected regarding the stock exchange.
11. What is Legacy database?
A Legacy database is a group of heterogeneous databases that combines different
kinds of data systems such as relational or objects –oriented databases, hierarchical
databases, network databases, and spread sheets, multimedia databases or file systems.
12. What are the steps in the data mining process?
• Data Cleaning
• Data Integration
• Data Selection
• Data Transformation
• Data Mining
• Pattern Evaluation
• Knowledge Representation
13. Define data cleaning?
Data Cleaning means removing the inconsistent data or noise and collecting
necessary information.
14. Define data mining?
Data mining is a process of extracting or mining knowledge from huge amount of
data.
15. Define pattern evaluation?
Pattern evaluation is used to identify the truly interesting patterns representing
knowledge based on some interesting measures.
2
16. Define Knowledge representation?
Knowledge representation techniques are used to present the mined knowledge to
the user.
17. Define class/ concept description?
Data can be associated with classes or concepts. It can be useful to describe
individual classes and concepts in summarized, concise and yet precise terms. Such
description of a class or a concept is called class/ concept descriptions.
18. What is Data Characterization?
Data Characterization is a summarization of the general characteristics or features
of a target class of data. The data corresponding to the user specified class or typically
collected by a database query.
19. What is data discrimination?
Data discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes.
20. What is Association analysis?
Association analysis is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
21. Define association rules?
Association rules are of the form X⇒ Y, that is “A1 ∧ ….. ∧ Am →
B1∧……∧Bn”, where Ai (for i∈{1, ….,m}) and Bj(for j∈ {1,…,n}) are attribute- value
pairs . The association rule X⇒ Y is interpreted as “database tuples that satisfy the
condition in X are also likely to satisfy the conditions in Y”.
22. List out the major components of a typical data mining system?
The major components in the typical data mining system architecture are
Database, Data warehouse, World Wide Web or other information repositories.
Database or data warehouse server
Knowledge base Data
mining engine Pattern
evaluation module User
interface
3
23. How does a data warehouse differ from a database? How are they similar?
Difference:
A database system or DBMS, consists of a collection of interrelated data, know as
a database and a set of software programs to manage and access the data.
A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema.
Similarity:
Queries can be applied to both database and data warehouse. A data warehouse is
modeled by multidimensional database structure.
24. What is concept description of hierarchies?
Concept description generates description for the characterization and comparison
of the data. It is some times called class description, when the concept to be described
refers to a class of objects.
25. What is constraint based association mining?
Specification of constraints or expectations to confine the search space of
database mining process called constraint based association mining. The
constraints can be,
• Knowledge based constraints
• Data constraints
• Dimension/level constraints
• Interestingness constraints
• Rule constraints.
26. What is linear regression?
Linear regression involves finding the best line to fit two attributes, so that one
attribute can be used to predict the other.
Example:
A random variable y (response variable) can be modeled as a linear function of
another random variable x (predictor variable).With the equation
y = wx+b.
27. What are the two data structures in cluster analysis?
Two data structures in cluster analysis are,
• Data matrix (object by variable structure)
• Dissimilarity matrix (object by object structure)
4
28. How are concept hierarchies useful in OLAP?
In the multidimensional model, data are organized onto multiple dimensions, and
each dimension contains multiple levels of abstraction defined by concept hierarchies.
This organization provides users with a flexibility to view data from different
perspectives. LAP provides a user-friendly environment for interactive data analysis.
29. What do you mean by virtual warehouse?
A virtual warehouse is a set of views over operational databases. For effective
query processing only some of the possible summary views may be materialized. A
virtual warehouse is easy to build but requires excess capacity on operational database
servers.
30. List out five data mining tools
• IBM’S Intelligent miner
• Data mined corporation Data mined
• Pilots Discovery server
• Tools from business objects and SAS Institute
• End user tools.
31. What is KDD?
Knowledge discovery is a process and consists of an iterative sequence of the
following steps.
• Data cleaning
• Data Integration
• Data Selection
• Data transformation
• Data Mining
• Pattern evaluation
• Knowledge presentation
32. List out the classification of data mining system?
• Classification according to the kinds of databases mined.
• Classification according to the kinds of knowledge mined.
• Classification according to the techniques utilized.
• Classification according to the application adapted.
5
33. What is concept description?
Concept description is a form of data generalization. A concept typically refers
to a collection of data such as frequent-buyers, graduate-students etc. Concept description
generates descriptions for the characterization and comparison of the data.
34. What is association rule mining?
It consists of first finding frequent item sets (set of items, such as A and B,
satisfies a minimum support threshold, or percentage of the task-relevant tuples) from
which strong association rules in the form of A=>B are generated. The rules also satisfy a
minimum confidence threshold. Associative can be further analyzed to uncover
correlation rules.
35. What is tree pruning?
Tree pruning is used to remove the anomalies in the training data due to noise
outliers. It addresses the problem of overfilling the data. Two approaches of tree pruning.
Pre-pruning --Tree is pruned by halting its construction early.
Post-pruning--Removes sub trees from a fully grown tree.
36. What is cluster analysis?
The process of grouping a set of physical or abstract object into classes of similar
objects is called clustering.
A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other clusters.
37. What is concept hierarchy?
Concept hierarchy defines a sequence of mapping from a set of low-level
concepts to higher-level, more general concepts. Concept hierarchies are implicit within
the database schema. A concept hierarchy is a total or partial order among attribute in a
database schema is called schema hierarchy.
38. What is Aggregation and metadata?
Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute monthly and
annual total amount. This is used in constructing a data cube for analysis of data at
multiple granularities.
Metadata are data about data which define warehouse objects. Metadata are
created for the data names and definition of the given warehouse.
6
39. What is star schema and snow flake schema?
Star schema is the most common modeling paradigm in which the data
warehouse contains
• A large central table (fact table) containing the bulk of data with no redundancy.
• A set of smaller attendant tables, one for each dimension.
Snowflake schema is a variant of the star schema model where some dimension
tables are normalized; thereby further splitting the data into additional tables.
40. Write short notes on spatial clustering?
Spatial data clustering identifies clusters or densely populated regions, according
to some distance measurements in a large, multidimensional data set.
41. State the types of Linear Model and state its use?
Generalized Linear model represent the theoretical foundation on which linear
regression can be applied to the modeling of categorical response variables. The types of
generalized linear model are
• Logistic regression
• Poisson regression
42. What are the goals of Time series analysis?
• Finding patterns in the data
• Predicting future values.
43. What is smoothing?
Smoothing is an approach that is used to remove nonsystematic behaviors found
in a time series. It can be used to detect trends in time series.
44. What is Lag?
The time difference between related items is referred to as Lag.
45. Write the preprocessing steps that may be applied to the data for classification
and prediction?
• Data cleaning
• Relevance analysis
• Data transformation
46. Define Data Classification?
It is a two-step process. In the first step, a model is built describing a
predetermined set of data classes or concepts. The model is constructed by analyzing
7
database tuples described by attributes. In the second step, the model is used for
classification.
47. What are Bayesian Classifiers?
Bayesian Classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given sample belongs to a particular class.
48. What is a decision tree?
It is a flowchart like tree structure, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and leaf node represents classes
or class distribution. Decision tree is a predictive model. Each branch of the tree is a
classification question and leaves of the tree are partition of the data set with their
classification.
49. Where are Decision Trees mainly used?
• Used for exploration of data set and business problems
• Data preprocessing for other predictive analysis
• Statisticians use decision trees for exploratory analysis.
50. How will you solve a classification problem using Decision Tree?
• Decision Tree Induction:
Construct a decision tree using training data.
• For each ti € D apply the decision tree to determine its class
ti-tuple
D-Database
51. How is association rules mined from large databases?
Association rule mining is a two step process.
• Find all frequent itemsets.
• Generate strong association rules from the frequent itemsets.
52. What is the classification of association rules based on various criteria?
1. Based on the types of values handled in the rule
a. Boolean association rule
b. Quantitative association rule.
2. Based on the dimensions of data involved in the rule
a. Single dimensional association rule
b. Multidimensional association rule
3. Based on the levels of abstractions involved in the rule
8
a. Single level association rule
b. Multilevel association rule
4. Based on various extensions to association mining
a. Maxpatterns
b. Frequent closed itemsets
53. What is Apriori algorithm?
Apriori algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules using prior knowledge. Apriori algorithm uses prior knowledge
of frequent itemset properties and it employees an iterative approach known as level-wise
search where k-itemsets are used to explore (k+1)-itemsets.
54. Define a Data mart?
Data mart is a pragmatic collection of related facts, but does not have to be
exhaustive or exclusive. A data mart is both a kind of subject area and an application.
Datamart is a collection of numeric facts.
55. What is data warehouse performance issue?
The performance of data warehouse is largely a function of the quantity and the
type of data stored within a database and the query/data loading work load placed upon
the system.
56. What is Data Inconsistency Cleaning?
This can be summarized as the process of cleaning up the small inconsistencies
that introduce themselves into the data. Examples include duplicate keys and
unreferenced foreign keys.
57. Merits of Data warehouse.
* Ability to make effective decisions from database
* Better Analysis of data and decision support
* Discover trends and correlations that benefits business
* Handle huge amount of data
58. What are the characteristics of data warehouse?
* Separate
* Available
* Integrated
* Subject oriented
* Not dynamic
* Consistency
9
* Iterative Development
* Aggregation Performance
59. List some of the data warehouse tools.
* OLAP (Online Analytic Processing)
* ROLAP (Relational OLAP)
* End User Data Access Tool
* Ad Hoc Query Tool
* Data Transformation Services
* Replication
60. Explain OLAP.
The general activity of querying and presenting text and number data from data
warehouses, as well as a specifically dimensional style of querying and presenting that is
exemplified by a number of "OLAP Vendors". The OLAP vendors technology is non-
relational and is almost always biased on an explicit multidimensional cube of data.
OLAP databases are also known as multidimensional cube of databases.
61. Explain ROLAP.
ROLAP is a set of user interfaces and applications that give a relational database,
a dimensional flavor. ROLAP stands for Relational Online Analytic Processing.
62. Explain End User Data Access Tool?
End User Data Access Tool is a client of the data warehouse. In a relational data
warehouse, such as client maintains a session with the presentation server, sending a
stream of separate SQL requests to the server. Eventually the End User Data Access Tool
is done with the SQL session and turns around to present a screen of data or a report, a
graph, or some other higher form of analysis to the user. An End User Data Access Tool
can be as simple as an Ad Hoc Query Tool or can be complex as a sophisticated data
mining or modeling application.
63. Explain Ad Hoc Query Tool?
It is a specific kind of end user data access tool that invites the user to form their
own queries by directly manipulating relational tables and their joins. Ad Hoc Query
Tools, as powerful as they are, can only be effectively used and understood by about 10%
of all the potential end users of a data warehouse.
64. Name some of the data mining applications.
* Data mining for biomedical and DNA Data Analysis
* Data Mining for Financial Data Analysis
10
10
* Data Mining for the Retail Industry
* Data Mining for the Telecommunication Industry
65. What are the contributions of Data Mining to DNA Analysis?
* Semantic Integration of heterogeneous, distributed genome databases
* Similarity Search and Comparison among DNA Sequences
* Association Analysis: identification of co-occurring gene sequences
* Path Analysis: Linking genes to different stages of disease development
* Visualization Tools and genetic data analysis
66. Name some examples of Data Mining in Retail Industry.
* Design and Construction of Data Warehouses based on the benefits of Data
Mining
* Multidimensional Analysis of sales, customers, products, time and region
* Analysis of the effectiveness of sales campaigns
* Customer retention analysis of customer loyalty
* Purchase recommendation and cross-reference of item
67. What is the difference between "supervised" and "unsupervised" learning
scheme?
In Data Mining during classification the class label of each training sample is
provided, this type of training is called "supervised learning" i.e., the learning of the
model is supervised in that it is told to which class each training sample belongs. E.g.
Classification
In unsupervised learning the class label of each training sample is not known and
the member or set of classes to be learned may not be known in advance. E.g. Clustering
68. Discuss the importance of similarity metric clustering? Why is it difficult to
handle categorical data for clustering?
The process of grouping a set of physical or abstract objects into classes of similar
objects is called "clustering". Similarity metric is important because it is used for outlier
detection. The clustering algorithm which is main memory based can operate only on the
following two data structures namely,
a) Data Matrix
b) Dissimilarity Matrix
So it is difficult to handle categorical data.
11
11
69. Mention at least 3 advantages of Bayesian Networks for data analysis. Explain
each one
a) Bayesian Network is a graphical representation of unknown knowledge that is
easy to construct and interpret.
b) The representation has formal probabilistic semantics, making it suitable for
statistical manipulation
c) The representation is used for encoding uncertain expert knowledge in expert
systems.
70. Why do we need to prune a decision tree? Why should we use a separate pruning
data set instead of pruning the tree with the training database?
When a decision tree is built, many of the branches will reflect animation in
the training data due to noise or outliers. Tree pruning methods are needed to address this
problem of over fitting the data.
71. Explain the various OLAP operations?
a) Roll-up: The roll up operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension.
b) Drill-down: It is the reverse of roll up. It navigates from less detailed data to
more detailed data.
c) Slice: Performs a selection on one dimension of the given cube, resulting in a
sub cube.
72. Discuss the concepts of frequent itemset, support & confidence?
A set of items is referred to as itemset. An itemset that contains k items is
called k-itemset. An itemset that satisfies minimum support is referred to as frequent
itemset.
Support is the ratio of the number of transactions that include all items in
the antecedent and consequent parts of the rule to the total number of transactions.
Confidence is the ratio of the number of transactions that include all items
in the consequent as well as antecedent to the number of transactions that include all
items in antecedent.
73. Why is data quality so important in a data warehouse environment?
Data quality is important in a data warehouse environment to facilitate
decision- making. In order to support decision-making, the stored data should provide
information from a historical perspective and in a summarized manner.
12
12
74. How can data visualization help in decision-making?
Data visualization helps the analyst gain intuition about the data being
observed. Visualization applications frequently assists the analyst in selecting display
formats, viewer perspective and data representation schemas that faster deep intuitive
understanding thus facilitating decision-making.
75. What do you mean by high performance data mining?
Data mining refers to extracting or mining knowledge. It involves an
integration of techniques from multiple disciplines like database technology, statistics,
machine learning, neural networks, etc. when it involves techniques from high
performance computing it is referred as high performance data mining.
76. What are the merits Of Data Warehouse?
The merits of data warehouse are the following
• Ability to make effective decisions form the database.
• To discover trends and correlations as they provide benefit to the business.
• Better analysis of data and decision support.
• It leads to better understanding of the business and handle huge amount of
data.
• There is a possibility of the customer being served better.
• Better understanding of the business risks.
• Improvement of the business process.
• Being able to make tailor made products and services.
77. What are the merits of spatial Data Warehouse?
The merits of spatial data warehouse are the following
• Make dynamic geographic queries on data.
• To aggrgate your data to geographic areas.
• To analyse data and spatial reorganization of it.
• Visualization and presentation of data.
78. Describe the two common approaches of Tree Pruning?
In the pre pruning approach a tree is pruned by halting its construction early. The
second approach, post pruning, removes branches from a fully grown tree. A tree node is
pruned by removing its branches.
13
13
79. What is clustering?
Clustering is the process of grouping the data into classes or clusters so that
objects within a cluster have high similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
80. What are the requirements of clustering?
• Scalability
• Ability to deal with different types of attributes
• Ability to deal with noisy data
• Minimal requirements for domain knowledge to determine input parameters
• Constraint based clustering
• Interpretability and usability
81. State the categories of clustering methods?
• Partitioning methods
• Hierarchical methods
• Density based methods
• Grid based methods
• Model based methods
82. Differentiate between lazy learner and eager learner?
Nearest neighbor classifiers are lazy learners in that they store all of the training
samples and do not build a classifier until a new (unlabeled) sample needs to be
classified.
In eager learning methods such as decision tree induction, back propagation
constructs a generalized model before receiving a new sample to classify.
83. What is network pruning?
The first step forwards extracting rules from neural networks pruning. This
consists of removing weighted links that do not result in a decrease in the classification
accuracy of the given network.
84. List the various criteria of classification in data mining system?
• Kinds of databases mined
• Kinds of knowledge mined
• Kinds of techniques utilized
• Application adapted
14
14
85. Name some data mining techniques?
• Statistics
• Machine learning
• Decision trees
• Hidden markov model
• Artificial neural networks
• Genetic algorithms
• Meta learning
86. Explain DBMiner tool in data mining?
• System Architecture
• Input and Output
• Data mining tasks supported by the system
• Support for task and method selection
• Support of the KDD process
• Main applications
• Current status
87. Define Iceberg query?
It computes an aggregate function over an attribute or set of attributes in order to
find aggregate values above some specified threshold. Given relation R with attributes
a1,a2,…..an and b, and an aggregate function, agg_f, an iceberg query is the form Select
R.a1,R.a2,……….,R.an, agg_f(R.b) from relation R group by R.a1,R.a2,……….,R.an
having agg_f(R.b)>= threshold
88. Define DBMiner?
DBMiner is an Online Analytical Mining System, developed fro interactive
mining of multiple –level knowledge in large relational databases and data warehouses.
89. List out the DBMiner tasks?
• OLAP analyzer
• Association
• Classification
• Clustering
• Prediction
• Time series analysis.
15
15
90. Explain how data mining is used in Health care analysis?
• Healthcare data mining and its aims
• Healthcare data mining technique
• Segmenting patients into groups
• Identifying patients with recurring health problems
• Relation between disease and symptoms
• Curbing the treatment costs
• Predicting medical diagnosis
• Medical research
• Hospital administration
• Applications of data mining in Healthcare
91. Explain Data mining applications for financial data analysis?
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
92. Explain Data mining applications for the Telecommunication industry?
• Multidimensional analysis of telecommunication data.
• Fraudulent pattern analysis and the identification of unusual patterns.
• Multidimensional association and sequential pattern analysis.
• Use of visualization tools in telecommunication data analysis.
93. Define Spatial Data Warehouse?
A Spatial Data warehouse is a subject oriented, integrated, time variant
and non-volatile collection of both spatial and non-spatial data in support of spatial data
mining and spatial data related decision making process.
94. What are the different types of dimensions in a spatial data cube?
• A non spatial dimension.
• A spatial to non spatial dimensions.
• A spatial to spatial dimensions.
95. Define Spatial Association rule?
A spatial association rule is of the form A⇒B[s%, c%].where A,B are sets of
spatial or non spatial predicates, s% is support of the rule and c% is the confidence of the
rule.
16
16
96. Define Horizontal Parallelism?
Horizontal Parallelism which means that the database is partitioned across
multiple disks and parallel processing occurs within a specific task that is performed
concurrently on different processors against different sets of data.
97. Define Vertical Parallelism?
Vertical Parallelism which occurs among different tasks all component
query operations is executed in parallel in a pipelined fashion.
98. What is the need for OLAP?
• To analyze data stored in database
• To analyze different dimensions in multidimensional database.
99. Explain the various types of variables used in clustering?
• Interval scaled variables
• Binary variables
o Symmetric binary variables
o Asymmetric binary variables
• Nominal variables
• Ordinal variables
• Ratio-scaled variables
100. Explain the hierarchical method of clustering?
• Agglomerative and Divisive hierarchical clustering
• BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
• CURE(Clustering Using REpresentatives)
• Chameleon
17
17

Microsoft Word - CS1634-Data Warehousing & Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microsoft Word - CS1634-Data Warehousing & Data Mining

Uploaded by

Copyright:

Available Formats

1. What are the uses of statistics in data mining?

You might also like