Professional Documents
Culture Documents
UNIT I
Unit1: Introduction: Data mining – Functionalities – Classification – Introduction to Data Warehousing – Data
Preprocessing : Preprocessing the Data – Data cleaning – Data Integration and Transformation – Data Reduction
PART A –2 MARKS
1. List outprocess of KDD. CO-1 K-1
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation.
2. Define Data mining. CO-1 K-1
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends and
make more-informed business decisions.
3. Listouttheapplicationsofdata mining CO-1 K-2
Data Mining Applications
Financial Data Analysis.
Retail Industry.
Telecommunication Industry.
4. Differentiatedataminingtoolsandquerytools. CO-1 K-2
Query tools can be used to easily build and input queries to databases. ... On the
other hand, Data Mining is a technique or a concept in computer science, which
deals with extracting useful and previously unknown information from raw
data.
5. Whatismeantby machinelearning? CO-1 K-1
Machine learning (ML) is a type of artificial intelligence (AI) that allows
software applications to become more accurate at predicting outcomes without
being explicitly programmed to do so. Machine learning algorithms use historical
data as input to predict new output values.
6. Whatarethetechniquesusedindata mining?
There are numerous crucial data mining techniques to consider when entering
the data field, but some of the most prevalent methods include clustering, data
cleaning, association, data warehousing, machine learning, data visualization,
1 Data Mining
classification, neural networks, and prediction.
7. Defineclustering. CO-1 K-1
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group than those in other groups. In simple words, the aim is
to segregate groups with similar traits and assign them into clusters.
8. Defineregression. CO-1 K-1
The term “Regression” refers to the process of determining the relationship
between one or more factors and the output variable. The outcome variable is
called the response variable, whereas the risk factors and co-founders are known
as predictors or independent variables.
9. Givethetypesofregression. CO-1 K-1
Linear Regression. ...
Logistic Regression. ...
Polynomial Regression. ...
10. Whatisclassification? CO-1 K-1
The system of sorting living organisms into various groups based on their
characteristic similarities and differences is called classification.
11. Whatisanassociationrule? CO-1 K-1
Association rules take the form “If antecedent, then consequent,” along with a
measure of the support and confidence associated with the rule.
12. CO-1 K-1
Defineprediction
Predication is the process of identifying the missing or unavailable numerical
data for a new observation. In classification, the accuracy depends on finding the
class label correctly. In prediction, the accuracy depends on how well a given
predictor can guess the value of a predicated attribute for new data.
2 Data Mining
In supervised learning, input data is provided to the model along with the
output. In unsupervised learning, only input data is provided to the model.
3 Data Mining
2. Whatistheneedofdatawarehouses? CO-2 K-4
The need for Data Warehouse is to generate reports, feed data to Business
Intelligence (BI) tools, forecast trends, and train Machine Learning models. Data
Warehouse stores data from multiple sources such as APIs, Databases, Cloud
Storage, etc., using the ETL (Extract Load Transform) process.
3. DefineOLAP. CO-2 K-1
OLAP (for online analytical processing) is software for performing
multidimensional analysis at high speeds on large volumes of data from a data
warehouse, data mart, or some other unified, centralized data store.
4. Definemultidimensionaldatamodel. CO-2 K-1
The multi-Dimensional Data Model is a method which is used for ordering data
in the database along with good arrangement and assembling of the contents in
the database.
5. Whatisadatacube? CO-2 K-2
A data cube refers is a three-dimensional (3D) (or higher) range of values
that are generally used to explain the time sequence of an image's data. It is a
data abstraction to evaluate aggregated data from a variety of viewpoints
6. Definedimensions. CO-2 K-2
Dimensions in mathematics are the measure of the size or distance of an object or
region or space in one direction. In simpler terms, it is the measurement of the
length, width, and height of anything. Dimensions are generally expressed as:
Length.
7. Whatarefacts? CO-2 K-1
There are three types of facts: Summative facts: Summative facts are used with
aggregation functions such as sum (), average (), etc. Semi summative facts:
There are small numbers of quasi-summative fact aggregation functions that will
apply. For example, consider bank account details.
4 Data Mining
10. Definedimensiontable. CO-2 K-1
A dimension table is a database table referencing defining pieces of information
or attributes for particular records in a primary database table. Experts may
describe a dimension table as part of a "database schema" or a database
conceptual map that shows the logical construction of the database.
11. Definefacttable. CO-2 K-2
A fact table stores quantitative information for analysis and is often
denormalized. A fact table works with dimension tables. A fact table holds the
data to be analyzed, and a dimension table stores data about the ways in which
the data in the fact table can be analyzed.
12. Whatarelatticeofcuboids? CO-2 K-1
Lattice structures have been developed which consists of data cubes
or cuboids. In the Lattice framework base cuboid contains all N - dimensions and
moving up to the hierarchy we reach to 0 - Dimensional cuboid called apex
cuboid. New cuboid may be generated by roll-up through dimension reduction
13. Whatisapexof cuboids? CO-2 K-4
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty.
It contains the total sum of all sales. The base cuboid is the least generalized
(most specific) of the cuboids. The apex cuboid is the most generalized (least
specific) of the cuboids, and is often denoted as all.
14. Listout the variousOLAP Operations. CO-2 K-1
There are primary five types of analytical OLAP operations in data warehouse:
1) Roll-up 2) Drill-down 3) Slice 4) Dice and 5) Pivot.
15. Givethenamesofwarehouseschemas. CO-2 K-1
Following are the three major types of schemas:
1. Star Schema.
2. Snowflake Schema.
3.Galaxy Schema
16. Definestarschema. CO-2 K-2
A star schema is a database organizational structure optimized for use in a data
warehouse or business intelligence that uses a single large fact table to store
transactional or measured data, and one or more smaller dimensional tables that
store attributes about the data.
17. Definesnowflakeschema. CO-2 K-4
A snowflake schema is a logical arrangement of tables in a multidimensional
database such that the entity relationship diagram resembles a snowflake shape.
The snowflake schema is represented by centralized fact tables which are
connected to multiple dimensions.
5 Data Mining
18. Definemetadata. CO-2 K-1
A data mart is a simple form of data warehouse focused on a single subject or
line of business. With a data mart, teams can access data and gain insights faster,
because they don't have to spend time searching within a more complex data
warehouse or manually aggregating data from different source.
19. Definedatamart. CO-2 K-1
A data mart is a simple form of data warehouse focused on a single subject or
line of business. With a data mart, teams can access data and gain insights faster,
because they don't have to spend time searching within a more complex data
warehouse or manually aggregating data from different sources
20. Whataretheapplicationsofmetadata? CO-2 K-2
1. ID and port used by a peer to share files.
2. Name and size of the uploaded/downloaded files.
3. Flow encryption level.
4. Software version.
UNIT 3
Mining Association Rules: Basics Concepts – Single Dimensional Boolean Association Rules From Transaction
Databases, Multilevel Association Rules from transaction databases – Multi dimension Association Rules from
Relational Database and Data Warehouses.
PART A –2 MARKS
1. CO-3 K-2
DefineAssociationrulemining.
Association rule mining, at a basic level, involves the use of machine
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association
rules
2. CO-3 K-1
DefineAssociationrulemining.
Association rule mining, at a basic level, involves the use of machine
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association
rules
3. Whatisclassificationofassociationrulemining? CO-3 K-2
Classification rule mining aims to discover a small set of rules in the database
that forms an accurate classifier. Association rule mining finds all the rules
existing in the database that satisfy some minimum support and minimum
confidence constraints.
4. WhatisthepurposeofApriorialgorithm? CO-3 K-2
Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent
6 Data Mining
individual items in the database and extending them to larger and larger item sets
as long as those item sets appear sufficiently often in the database.
5. GivetwotechniquestoimproveApriorialgorithm. CO-3 K-2
Techniques to improve the efficiency of Apriori algorithm
7 Data Mining
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association
rules
12. CO-3 K-1
DefineAssociationrulemining.
Association rule mining, at a basic level, involves the use of machine
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association
rules
13. Whatisminimum support and minimum confidence CO-3 K-2
ofassociationrulemining?
Classification rule mining aims to discover a small set of rules in the database
that forms an accurate classifier. Association rule mining finds all the rules
existing in the database that satisfy some minimum support and minimum
confidence constraints.
14. WhatisthepurposeofApriorialgorithm? CO-3 K-2
Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent
individual items in the database and extending them to larger and larger item sets
as long as those item sets appear sufficiently often in the database.
15. GivetwotechniquestoimproveApriorialgorithm. CO-3 K-2
Techniques to improve the efficiency of Apriori algorithm
8 Data Mining
Quantitative characteristics are numeric and consolidate order. Numeric traits
should be discretized. Multi dimensional affiliation rule comprises of more than
one measurement.
19. Whatishybriddimensionalassociation rule? CO-3 K-2
The Apriori technique finds the Hybrid dimension association rules mining
algorithm satisfies the definite condition on the basis of multidimensional
transaction database. Boolean Matrix based approach has been employed to
generate frequent item sets in multidimensional transaction databases.
UNIT 4
Classification and Prediction: Introduction – Issues – Decision Tree Induction – Bayesian Classification.
Classification based on Concepts from Association Rule Mining – Other Methods. Prediction – Introduction –
Classifier Accuracy
PART A –2 MARKS
1. Whatarethestepsinvolvedinpreparingthedataforclassification? CO-4 K-2
There are 7 steps to effective data classification:
Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the
9 Data Mining
test, and each leaf node (terminal node) holds a class label.
An attribute selection measure is a heuristic for choosing the splitting test that
“best” separates a given data partition, D, of class-labelled training tuples into
single classes.
Pre-pruning, also known as Early Stopping Rule, is the method where the
subtree construction is halted at a particular node after evaluation of some
measure. These measures can be the Gini Impurity or the Information Gain
Post-pruning considers the subtrees of the full tree and uses a cross-
validated metric to score each of the subtrees. To clarify, we are using subtree to
mean a tree with the same root as the original tree but without some branches.
10 Data Mining
10. DefineCentriodofthecluster. CO-4 K-2
A centroid is the imaginary or real location representing the center of the
cluster. Every data point is allocated to each of the clusters through reducing the
in-cluster sum of squares.
DBSCAN
DENCLUE
15. Listoutthepartitioningmethods CO-4 K-2
Range Partitioning.
Hash Partitioning.
List Partitioning.
Composite Partitioning
16. Defineattribute-oriented induction. CO-4 K-2
11 Data Mining
Attribute-oriented induction summarizes the information in a relational
database by repeatedly replacing specific attribute values with more general
concepts according to user-defined concept hierarchies.
Outlier is a data object that deviates significantly from the rest of the data
objects and behaves in a different manner. An outlier is an object that deviates
significantly from the rest of the objects. They can be caused by measurement or
execution errors.
UNIT 5
Cluster Analysis: Introduction – Types of Data in Cluster Analysis, Partitioning Methods – Hierarchical Methods
Density Based Methods – GRID Based Method – Model based Clustering Method.
PART A –2 MARKS
1. DefineCLARA. CO-5 K-2
CLARANS (Clustering Large Applications based on Randomized Search)
is a Data Mining algorithm designed to cluster spatial data.
12 Data Mining
Agglomerative: This is a "bottom-up" approach: each observation starts in
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
13 Data Mining
9. What is prediction? CO-5 K-2
Predictive data mining is data mining that is done for the purpose
of using business intelligence or other data to forecast or predict trends. This type
of data mining can help business leaders make better decisions and can add value
to the efforts of the analytics team.
Content mining is the browsing and mining of text, images, and graphs of
a Web page to decide the relevance of the content to the search query. This
browsing is done after the clustering of web pages through structure mining and
supports the results depending upon the method of relevance to the suggested
query.
15. Define web structure mining. CO-5 K-2
Web structure mining, one of three categories of web mining for data, is a tool
used to identify the relationship between Web pages linked by information or
direct link connection. It offers information about how different pages are linked
together to form this huge web.
16. Define web usage mining. CO-5 K-4
14 Data Mining
Web usage mining, a subset of Data Mining, is basically the extraction of
various types of interesting data that is readily available and accessible in the
ocean of huge web pages, Internet- or formally known as World Wide Web
(WWW).
Spatial data mining is the application of data mining to spatial models. In spatial
data mining, analysts use geographical or spatial information to produce business
intelligence or other results. This requires specific techniques and resources to
get the geographical data into relevant and useful formats.
The mining sequence covers all aspects of mining, including: prospecting for ore
bodies, analysis of the profit potential of a proposed mine, extraction of the
desired materials and, once a mine is closed, the restoration of all lands used for
mining to their original state.
Graph Mining is the set of tools and techniques used to (a) analyze the properties
of real-world graphs, (b) predict how the structure and properties of a given
graph might affect some application, and (c) develop models that can generate
realistic graphs that match the patterns found in real-world graphs of interest.
15 Data Mining