You are on page 1of 24

Data mining Questions and Answers

UNIT-1
1) What is data mining ?

Data mining is the non-trival process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data.
Or
Data mining is the process of sorting through large data sets to identify patterns and relationships to
solve problems through data analysis.

2) Define knowledge discovery database.

KDD stands for knowledge discovery database. It is the process of identifying a valid potentially
useful and ultimately understandable structure in data.

3) Write the stages of KDD

Stages of KDD are:


1.selection
2. preprocessing
3. Transformation
4. Interpretation and Evaluation
5. Data visualisation

4) What is unsupervised learning?


Unsupervised learning means learning from observation and discovery.In this mode of
learning,there is no training set of the classes.here no supervision needed.

5) Name the two fundamental goals of data mining?


➢ Prediction
➢ Description

6) What is clustering?
Clustering analysis is a data mining techniques to identify data that are like each other. This
process helps to understand the differences and similarities between the data.

7.Name the categories of issues in data mining .


➢ Limited information
➢ Noise or missing data.
➢ User interaction and prior knowledge
➢ Uncertainity
➢ Size, updates and irrelevant.

8.What is data warehouse?


A data warehouse is a subject-oriented , integrated, time-varing, non-volatile collection at data in
support of the management’s decision-making process.

9. What are the three levels in architecture of datawarehouse?


➢ Tier 1
➢ Tier 2
➢ Tier 3
Bottom Tier (Data Warehouse Server) Middle Tier (OLAP Server) Top Tier (Front end Tools).

10. What is data cube?


A data cube is essentially the multidimentional representation of data were in data are
partitioned into different cells based on the values of their dimension attributes. It is a data cube in
a multidimentional structure used to store data.

11.Name the three relational schema in ROLAP


1)Star schema
2)Snowflake schema
3)Fact constullation schema

12. what is metadata?


It is a data about data. It contains the description of all data within the data warehouse it
includes all 3 levels of the architecture, the objective of meta data is to keep a record of all activities
of the data warehouse.

13. what is datamart ?


It can be viewed as smaller data warehouse or a subset of data warehouse customized for a
particular aspects of business. It is a partion of overall data warehouse.

14. List any four OLAP operations


➢ Slicing
➢ Dicing
➢ Drilling

15. What is dimension modelling?


The concept of a dimension provides a lot of semantic information, especially about the
hierarchical relationship between its elements. Dimension modelling is a special technique for
structuring data around business concepts.
Unlike ER modelling, which describes entities and relationships, dimension modelling structures
the numeric measures and the dimensions. The dimension schema can represent the details of the
dimensional modelling. Derived data is a type of client-side processing or data aggregation that has
been performed on the stored master data.

16 . What is derived data?


The data that has been aggregated, selected all formatted for ensure and decision support
system application from the reconciled data are called derived data .

17. State the difference between traditional data base and data warehouse.
----------------
18. Define slicing and dicing operations of OLAP
-----------------
19. What is operational data ?
It is a data collected an available in the transaction processing system. It resides in the source
database
20 What is reconciled data?
The detailed transaction level data that have been cleaned and confirmed for consistency is
called reconciled data. It serves as the base data for all warehouse activity.

UNIT-1 5 MARKS
1. Explain the stages of KDD.
KDD-Knowledge Discovery in Database
The KDD process tends to be highly iterative and interactive. Data mining is only one of the
steps involved in knowledge discovery in databases. The various steps in the knowledge discovery
process include data selection, data cleaning and pre-processing, data transformation and reduction,
data mining algorithm selection and finally the post-processing and the interpretation of the
discovered knowledge.
The stages of KDD, starting with the raw data and finishing with the extracted knowledge are
given below;
❖ Selection: This stage is concerned with selecting or segmenting the data that are relevant to
some criteria.
For example, for credit card customer profiling, we extract the type of transactions for each
type of customers and we may not be interested in details of the shop where the transaction
takes place.
❖ Pre-processing: It is the data cleaning stage where unnecessary information is removed. This
stage reconfigures the data to ensure a consistent format, as there is a possibility of
inconsistent formats.
❖ Transformation: The data is not merely transferred, but transformed in order to be suitable
for the task of data mining. In this stage, the data made usable and navigable.
❖ Data mining: This stage is concerned with extraction of pattern from the data. It is referred to
finding of relevant and useful information from databases.
❖ Interpretation and Evaluation: The pattern obtained in the data mining stage is concerned into
knowledge which in turn is used to support decision making.
❖ Data visualization: Data mining allows the analyst to focus on certain patterns and trends and
explore them in-depth using visualization. It helps users to examine large volumes of data
and detect the patterns visually. Visual displays of data such as maps, charts and other
representation allow data to be represented compactly to the users.

2. State the differences between DBMS and Data mining.


Data Base Management System (DBMS) Data Mining
• Full fledged system for housing and • Which deals with extracting useful and
managing a set of digital database. proviously unkonown information from
raw data.
• Organized collection of data. Most of • Analysing data from different
the times, thses raw data are stored in information to discover useful
very large database. knowledge .
• Supports query language. • Supports automatic data exploration.

• We know exactly what information we • We are not dear about the possible
are looking. correlation or patterns.
• Loosly coupled . • Loosly coupled or tightly coupled .

3. Explain briefly any five application areas of data mining


DM APPLICATION AREAS

1. Data Mining Applications in Healthcare.


2. Data Mining Applications in Marketing.
3. Data Mining Applications in Education.
4. Data Mining Applications in Manufacturing.
5. Data Mining Applications in CRM
6. Data Mining Applications in Fraud Detection
7. Data Mining Applications in CRIMINAL INVESTIGATION.
8. Data Mining Applications in Customer Segmentation
9. Data Mining Applications in Financial Banking
10. Data Mining Applications in Corporate Surveillance
11. Data Mining Applications in Research Analysis
12. Data Mining Applications in Bio Informatics
1)Future HealthcareData :

mining holds great potential to improve health systems. It uses data and analytics to identify best
practices that improve care and reduce costs. Researchers use data mining approaches like multi-
dimensional databases, machine learning, soft computing, data visualization and statistics. Mining
can be used to predict the volume of patients in every category. Processes are developed that make
sure that the patients receive appropriate care at the right place and at the right time. Data mining
can also help healthcare insurers to detect fraud and abuse.

2)Market Basket Analysis:

Market basket analysis is a modelling technique based upon a theory that if you buy a certain group
of items you are more likely to buy another group of items. This technique may allow the retailer to
understand the purchase behavior of a buyer. This information may help the retailer to know the
buyer’s needs and change the store’s layout accordingly. Using differential analysis comparison of
results between different stores, between customers in different demographic groups can be done.

3)Education:

There is a new emerging field, called Educational Data Mining, concerns with developing methods
that discover knowledge from data originating from educational Environments. The goals of EDM are
identified as predicting students’ future learning behaviour, studying the effects of educational
support, and advancing scientific knowledge about learning. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the student. With the results
the institution can focus on what to teach and how to teach. Learning pattern of the students can be
captured and used to develop techniques to teach them.
4)Manufacturing Engineering:

Knowledge is the best asset a manufacturing enterprise would possess. Data mining tools can be
very useful to discover patterns in complex manufacturing process. Data mining can be used in
system-level designing to extract the relationships between product architecture, product portfolio,
and customer needs data. It can also be used to predict the product development span time, cost,
and dependencies among other tasks.

5)CRM:

Customer Relationship Management is all about acquiring and retaining customers, also improving
customers’ loyalty and implementing customer focused strategies. To maintain a proper relationship
with a customer a business need to collect data and analyse the information. This is where data
mining plays its part. With data mining technologies the collected data can be used for analysis.
Instead of being confused where to focus to retain customer, the seekers for the solution get filtered
results.

6)Fraud Detection: Billions of dollars have been lost to the action of frauds. Traditional methods of
fraud detection are time consuming and complex. Data mining aids in providing meaningful patterns
and turning data into information. Any information that is valid and useful is knowledge. A perfect
fraud detection system should protect information of all the users. A supervised method includes
collection of sample records. These records are classified fraudulent or non-fraudulent. A model is
built using this data and the algorithm is made to identify whether the record is fraudulent or not.

7)Intrusion Detection:

Any action that will compromise the integrity and confidentiality of a resource is an intrusion. The
defensive measures to avoid an intrusion includes user authentication, avoid programming errors,
and information protection. Data mining can help improve intrusion detection by adding a level of
focus to anomaly detection. It helps an analyst to distinguish an activity from common everyday
network activity. Data mining also helps extract data which is more relevant to the problem.

8)Customer Segmentation:

Traditional market research may help us to segment customers but data mining goes in deep and
increases market effectiveness. Data mining aids in aligning the customers into a distinct segment
and can tailor the needs according to the customers. Market is always about retaining the
customers. Data mining allows finding a segment of customers based on vulnerability and the
business could offer them with special offers and enhance satisfaction.

9)Financial Banking:

With computerized banking everywhere huge amount of data is supposed to be generated with new
transactions. Data mining can contribute to solving business problems in banking and finance by
finding patterns, causalities, and correlations in business information and market prices that are not
immediately apparent to managers because the volume data is too large or is generated too quickly
to screen by experts. The managers may find these information for better segmenting,targeting,
acquiring, retaining and maintaining a profitable customer.

10)Corporate Surveillance:

Corporate surveillance is the monitoring of a person or group’s behaviour by a corporation. The data
collected is most often used for marketing purposes or sold to other corporations, but is also
regularly shared with government agencies. It can be used by the business to tailor their products
desirable by their customers. The data can be used for direct marketing purposes, such as the
targeted advertisements on Google and Yahoo, where ads are targeted to the user of the search
engine by analyzing their search history and emails.

11)Research Analysis:

History shows that we have witnessed revolutionary changes in research. Data mining is helpful in
data cleaning, data pre-processing and integration of databases. The researchers can find any similar
data from the database that might bring any change in the research. Identification of any
cooccurring sequences and the correlation between any activities can be known. Data visualisation
and visual data mining provide us with a clear view of the data.Criminal InvestigationCriminology is a
process that aims to identify crime characteristics. Actually crime analysis includes exploring and
detecting crimes and their relationships with criminals. The high volume of crime datasets and also
the complexity of relationships between these kinds of data have made criminology an appropriate
field for applying data mining techniques. Text based crime reports can be converted into word
processing files. These information can be used to perform crime matching process.

12)Bio Informatics:

Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich. Mining biological
data helps to extract useful knowledge from massive datasets gathered in biology, and in other
related life sciences areas such as medicine and neuroscience. Applications of data mining to
bioinformatics include gene finding, protein function inference, disease diagnosis, disease prognosis,
disease treatment optimization, protein and gene interaction network reconstruction, data
cleansing, and protein sub-cellular location prediction.

4. Explain the Issues and Challenges in DM .


Data mining systems depend on databases to supply the raw input and this raises problems, such as
those databases tend to be dynamic, incomplete, noisy and large. Other problems arise as a result of
the inadequacy and irrelevance of the information stored.
The difficulties in data mining can be categorized as
1. Limited information
2. Noise or missing data
3. User interaction and prior knowledge
4. Uncertainty
5. Size, updates and irrelevant fields

Limited information:
A database is often designed for purposes other then that data mining and, sometimes,
some attributes which are essential for knowledge discovery of the application domain are
not present in the data. Thus, it may be very difficult to discover significant knowledge
about a given domain.

2. Noise and missing data :


Attributes that rely on subjective or measurement judgments can give rise to errors, such that some
examples may be misclassified. Missing data can be treated in a number of ways-simply disregarding
missing values, omitting corresponding records, inferring missing values from known values, and
treating missing data as a special value to be included additionally in the attribute domain. The data
should be cleaned so that it is free of errors and missing data.

3. User interaction and prior knowledge:


An analyst is usually not a KDD expert but simply a person making use of the data by means
of the available KDD techniques. Since the KDD process is by definition interactive and iterative, it is
challenging to provide a high performance, rapid-response environment that also assists the users in
the proper selection and matching of the appropriate techniques, to achieve their goals. There
needs to be more human-computer interaction and less emphasis on total automation, which
supports both the novice and expert users. The use of domain knowledge is important in all steps of
the KDD process. It would be convenient to design a KDD tool which is both interactive and Iterative.

4. Uncertainty:
This refers to the severity of error and the degree of noise in the data. Data precision is an important
consideration in a discovery system.

5. Size, updates and irrelevant fields:


Databases tend to be large and dynamic, in that their contents are keep changing as information is
added, modified or removed. The problem with this, from the perspective of data mining, is how to
ensure that the rules are up-to-date and consistent with the most current information.

5. Explain the relations of data mining with different areas.


OTHER RELATED AREAS:
STATISTICS:
Statistics is a theory rich approach for data analysis. Statistics, with its solid theoretical foundation,
generates results are difficult to interpret. Statistics is one of the foundational principles on which
data mining technology is built. Statistical analysis systems are used by analysis to detect unusual
patterns and explain patterns using statistical models, such as linear models.
MACHINE LEARNING:
Machine learning is the automation of a learning process and learning is equivalent to the
construction of rules based on observations. This is a broad field which includes not only learning
from examples but also reinforcement learning, learning with a teacher, etc. A learning algorithm
takes the data set and its additional information as the input and returns a statement,e.g, a concept
representing the results of learning as output.
Inductive learning, where the system concludes knowledge itself from observing its environment,
has two main strategies: Supervised learning and unsupervised learning.
1. SUPERVISED LEARNING:
Supervised learning means learning from examples, where a training set is given which acts as
examples for the classes. The system finds a description of each class. Once the description has been
formulated, it is used to predict the class of previously unseen objects.

2. UNSUPERVISED LEARNING:

Unsupervised learning is learning from observation and discovery. In this mode of


learning, there is no training set or prior knowledge of the classes. The system analyzes
the given set of data to observe similarities emerging out of the subsets of the data.
6. Explain ODS.

7. Explain the Three-Level Architecture of data warehouse


1.Operational data
2.Reconsiled data
3.Derived data
1.Operational data :- It is the data collected and available in the transaction processing system. It
resides in the source database.
2.Reconsiled data :- The detailed transaction level data that have been cleaned and confirmed for
consistency is called Reconsiled data. It serves as the base data for all warehouse activity.
3.Derived data :- The data that has been aggregated selected or formatted for enduser and decision
support system application from the reconsiled data are called derived data.

8. Explain the fundamental differences of a data warehouse from a database.


-----------
9. Explain OPERATIONAL VS. ANALYTICAL APPLICATIONS
Most business process involves a series of events whose nature and frequency may differ. Example
of such events are: A product is manufactured: one account is credited while another is debited; a
seat is reserved; an order is booked; an invoice is generated. The operational applications for such
processes are designed to be highly efficient-they optimize the execution of a large number of
atomic transactions with near fault-tolerant availability of data. We call such systems “Online
Transaction Processing” systems (OLTPs).
Primarily, operational applications differ from analytical applications in the type of questions
that the respective systems are capable of addressing. For instance, operational application queries
are of the “What” or “Who” type, whereas those of the analytical applications are of the “why” and
“what if” type. Analytical applications are able to provide insight into business patterns overlooked
by operational applications that deal with just routine business.

10. Explain data mining applications.


Repeated (3rd qstn)

11. Write the advantages and disadvantages of data warehouse.


Advantages:
1.Enhances end-user access to a wide variety of data.
2.Increases data consistancy.
3.Increases productivity and decreases computing costs.
4.It is able to combine data from different sources in one place.
5.It provides an intrastructure that could support changes to data and replication of the changed
data back into the operational system.
6.clean data .
7.Query processing :multiple options.
8.Security:Data and access
Disadvantages:
1.Extracting,cleaning,and loading data could be time consuming.
2.Problems with compatibility with system already in place.Eg:Transaction processing system.
3.Security could develop into a serious issue ,especially if the data warehouse is web accessible.
4.Long initial implementaion time and associated high cost.
5.Typically ,data is static and dated.

12. Describe with example different warehouse schema.


STAR SCHEMA: A star schema is a modelling paradigm in which the data
warehouse contains a large, single, central Fact Table and a set of smaller Dimension
Tables, one for each dimension.
The Fact Table contains the detailed summary data. Its primary key has one key per dimension.
Each dimension is a single, highly denormalized table. Every tuple in the Fact Table consists of the
fact or subject of interest, and the dimensions that provide that fact. Each tuple of the Fact Table
consists of a (foreign) key pointing to each of the Dimension Tables that provide its
multidimensional coordinates. It also stores numerical values (non-dimensional attributes, and
results of statistical functions) for those coordinates. The Dimension Tables consist of columns that
correspond to the attributes of the dimension. So each tuple in the Fact Table corresponds to one
and only one tuple in each Dimension Table. Where as, one tuple in a Dimension Table may
correspond to more than one tuple in the Fact Table. So we have a I:N relationship between the Fact
Table and the Dimension Table. The advantages of a star schema are that it is easy to understand,
easy to define hierarchies, reduces the number of physical joins, and requires low maintenance and
verysimple metadata
.SNOWFLAKE SCHEMA:
Star schema consists of a single fact table and a single denormalized dimension table for each
dimension of the multidimensional data model. To supportattribute hierarchies, the dimension
tables can be normalized to create snowflake schemas. A snowflake schema consists of a single fact
table and multiple dimension tables. Like the StarSchema, each tuple of the fact table consists of a
(foreign) key pointing to each of the dimension tables that provide its multidimensional coordinates.
It also stores numerical values (nondimensional attributes, and results of statistical functions) for
those coordinates. Dimension Tables in a star schema are denormalized, while those in a snowflake
schema are normalized.
The advantage of the snowflake schema is as follows:
A Normalized table is easier to maintain.
Normalizing also saves storage space, since an un-normalized Dimension Table
tends to be large and may contain redundant information:

FACT CONSTELLATION:
A Fact Constellation is a kind of schema where we have morethan one Fact Table sharing among
them some Dimension Tables. It is also called Galaxy Schema.
For example, let us assume that Deccan Electronics would like to have another Fact Table for supply
and delivery.

13. Explain virtual data warehouse.


VIRTUAL DATAWAREHOUSE:
This model creates a virtual. view of databases, allowing the creation of a "virtual warehouse" as
opposed to a physical warehouse. In a virtual warehouse, we have a logical description of all the
databases and their structures, and individuals who want to get information from those databases
do not have to know anything about themThis approach creates a single "virtual database" from all
the data resources. The data resources can be local or remote. In this type of a data warehouse, the
data is not moved from the sources. Instead, the users are given direct access to the data. The direct
access to the data is sometimes through simple SQL queries, view definition, The virtual data
warehouse scheme lets a client application access data distributed across multiple data sources
through a single SQL statement, a single interface. All data sources are accessed as though they are
local users and their applications do not even need to know the physical location of the data.
A virtual database is easy and fast, but it is not without problems. Since the queries must
compete with the production data transactions, its performance can be considerably degraded.
Since there is no metadata, no summary data or history, all the queries must be repeated, creating
an additional burden on the system. Above all, there is no clearing or refreshing process involved,
causing the queries to become very complex.

14. Describe with example any two warehouse schema.


repeated
15. Explain any two OLAP operations.
The basic OLAP operations for a multidimensional model are:
Slicing: This operation is used for reducing the data cube by one or more dimensions. The slice
operation performs a selection on one dimension of the given cube, resulting in a subcube.
Figure 1 shows a slice operation where the sales data are selected from the central cube for the
dimension time, using the criteria time= 'Q2'.
Slice time='Q2’ C [quarter, city, product] = C [city, product]
Figure1: Slice operation
Dicing : This operation is also used for reducing the data cube by one or more dimensions This
operation is for selecting a smaller data cube and analyzing it fromdifferent perspectives. The dice
operation defines a subcube by performing a selection on two or more dimensions.
Figure 2 shows a dice operation on the central cube

16. Explain dimension tables.


A dimension table is a table in a star schema of a data warehouse. A dimension table stores
attributes, or dimensions, that describe the objects in a fact table.In data warehousing, a dimension
is a collection of reference information about a measurable event. These events are known as facts
and are stored in a fact table. Dimensions categorize and describe data warehouse facts and
measures in ways that support meaningful answers to business questions. They form the very core
of dimensional modeling.
A data warehouse organizes descriptive attributes as columns in dimension tables. For
example, a customer dimension’s attributes could include first and last name, birth date, gender,
etc., or a website dimension would include site name and URL attributes. A dimension table has a
primary key column that uniquely identifies each dimension record (row). The dimension table is
associated with a fact table using this key. Data in the fact table can be filtered and grouped (“sliced
and diced”) by various combinations of attributes.
Dimension tables are referenced by fact tables using keys. When creating a dimension table in a
data warehouse, a system-generated key is used to uniquely identify a row in the dimension. This
key is also known as a surrogate key. The surrogate key is used as the primary key in the dimension
table. The surrogate key is placed in the fact table and a foreign key is defined between the two
tables. When the data is joined, it does so just as any other join within the database.
Like fact tables, dimension tables are often highly de-normalized, because these structures are not
built to manage transactions they are built to enable users to analyze data as easily as possible.
17. State the differences between KDD and data mining.
----------
18. Write a note on OLAP server
-----------

19. Explain discovery of association rules.


Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association rule
mining has a number of applications and is widely used to help discover sales correlations in
transactional data or in medical data sets.
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then
associations, which are called association rules An association rule has two parts: an
antecedent (if) and a consequent (then). An antecedent is an item found within the data. A
consequent is an item found in combination with the antecedent.
Association rules are created by searching data for frequent if-then patterns and using
thecriteria support and confidence toidentify the importantrelationships. Support isan
indication of how frequently the items appear in the data. Confidence indicates the number of
times the if-then statements are found true. A third metric, called lift, can be used to compare
confidence with expected confidence.mostAssociation rules are calculated from itemsets,
which are made up of two or more items. If rules are built from analyzing all the possible
itemsets, there could be so many rules that the rules hold little meaning. With that,
association rules are typically created from rules well-represented in data.
Methods to discover association rule
Association rule learning is a rule-based machine learning method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules
discovered in databases using some measures of interestingness.

20. Explain lattice of cuboids.


-----------
UNIT -2

1. What is an association rule?


Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association
rule mining has a number of applications and is widely used to help discover sales
correlations in transactional data or in medical data sets.

2. Define support and confidence.


---------------
3. Write the features of efficient algorithm.
* reduce the I/O operations.
*at the same time be efficient in computing.
4. Define frequent set
FREQUENT SET: Let T be the transaction database and ‹r be the user-specified minimum
support. An itemset X<A is said to be afrequent itemset in T with respect to ‹r, if s(X)r>=Σ

5. Define downward closure property


-------------------
6. Define upward closure property
------------------
7. Define maximal frequent set .
A maximal frequent itemset is represented as a frequent itemset for which none of its direct
supersets are frequent. The itemsets in the lattice are broken into two groups such as those
that are frequent and those that are infrequent.

8. Define border set.


An itemset is a border set if it is not frequent set, but all its proper subsets are frequent sets.
A set x is a border set if all its proper subsets are frequent sets (that is sets with atleast minimum
support),
But it itself is not a frequent set.Thus,the collection of border sets defines the borderline between
the frequent sets and nonfrequent sets , in the lattice of attribute sets

8. What is SVM?
-----------------
9. Write two advantages of decision trees
------------------
10. What are the two different methods of determining the goodness of a split.
11. ----------------------
12. Define rough set.
Rough Set: The rough set is a pair (A*(X),A*(X)). If the boundary region is empty set, then X is called
crisp set else X is called rough set.

12. Mention the two applications of neural network.


---------------
13. What is unsupervised learning.
--------------------
15. What is MLP?
Multi-Layer perceptron defines the most complicated architecture of artificial neural networks. It
is substantially formed from multiple layers of perceptron. MLP networks are usually used for
supervised learning format.

16. What is supervised learning?


It is a machine learning method when models are tarined using label data it needed supervision
to train model
Eg:learning with a teacher

17. What is neural network


Neural network are a different paradigam for computing,which draws its inspiration from
neuroscience. A neural network is a series of algorithm that endeavors to recognize underlying
relationship in a set of through a process that mimics the way the human brain operates

18. Define information system

19. What is CART?


Classification and Regression Tree[CART] is one of the popular methods of building decision tree
by splitting the records at each node, according to a function of a single attribute.

20. Write two disadvantages of an over fitted decision tree


----------------------------

5 MARKS
1. Explain Apriori algorithm with an example.
Apriori Algorithm
For finding frequent itemsets in a dataset for boolean association rule. Name of the algorithm is
Apriori because it uses prior knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.To improve
the efficiency of level-wise generation of frequent itemsets, an important property is used called
Apriori property which helps by reducing the search space.
Efficient Frequent Itemset Mining Methods:
Finding Frequent Itemsets Using Candidate Generation:
The Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on
the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an
iterative approach known as a level-wise search, where k-itemsets are used
to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to
find L3, and so on, until no more frequent[8/29, 8:31 PM] Jee: k-itemsets can be found. The finding
of each Lkrequires one full scan of the database. A two-step process is followed in Aprioriconsisting
of joinand prune action.

Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes thatBefore we start understanding the
algorithm, go through some definitions which are explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)[8/29, 8:32 PM] Jee: (II) compare candidate set item’s support count with
minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove those
items). This gives us itemset L1.

Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check foreach itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then remove those
items) this gives us itemset L2.[8/29, 8:33 PM] Jee:
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first element should match.So
itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}
{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are
frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly checkfor every itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and[8/29, 8:33 PM] Jee:
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3,
first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no
itemset in C4
We stop here because no frequent itemsets are found furtherThus, we have discovered all the
frequent item-sets. Now generation of strong association rule comes into picture. For that we need
to calculate confidence of each rule.
Confidence-
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strongassociation rules

2. Explain frequent set .


FREQUENT SET: Let T be the transaction database and ‹r be the user-specified minimum
support. An itemset X<A is said to be afrequent itemset in T with respect to ‹r, if s(X)r>=Σ

3. Explain partition algorithm for association rule discovery


the partition algorithm is based on the observation that the frequent sets are normally very few in
number compared to the sets of all items sets as a result if we partion the set of transactions to
smaller segments such that each segment can be accommodated in the main memory then we can
compute the set of frequent sets of each of these partitions.
the algorithm executes in two phases in the first phase, the partition algorithm logicaly divides the
database into a number of non overlapping partitions the partitions are considered one at the time
and all frequent item sets for the partition are generated thus if there are n partitions Phase 1 of the
algorithm takes and iterations. at the end of phase 1 frequent item sets are merge to generate a set
of all potential frequent itemsets. in phase 2 ,the actual support for these itemsets are generated
and the frequent items sets are identified.
the algorithm reads the entire database once during Phase 1 and during phase 2. the partition size
are chosen such that each partition can be accommodated in the main memory so that the
partitions are read only once in each phase.
Partition algorithm
partition database (T);n= number of partitions
//Phase 1
for i=1 to n do begin
read_ in_artition( Ti in P)
Li= generate all frequent itemsets of Ti using a apriori method in main memory. End
merge phase
for (K=2;Lk=0 i=1,2......n;k++) do begin
//Phase II
For i=1 to do begin
read_in_partition(Ti in P)
for all candidates c€Cg compute s(c)Ti;
Lc=IC€CgIs(c)r-*)
Answer=Lg
Example
the partition algorithm is based on the observation that the frequent sets are normally very few in
number compared to the sets of all items sets as a result if we partion the set of transactions to
smaller segments such that each segment can be accommodated in the main memory then we can
compute the set of frequent sets of each of these partitions.
the algorithm executes in two phases in the first phase, the partition algorithm logicaly divides the
database into a number of non overlapping partitions the partitions are considered one at the time
and all frequent item sets for the partition are generated thus if there are n partitions Phase 1 of the
algorithm takes and iterations. at the end of phase 1 frequent item sets are merge to generate a set
of all potential frequent itemsets. in phase 2 ,the actual support for these itemsets are generated
and the frequent items sets are identified.
the algorithm reads the entire database once during Phase 1 and during phase 2. the partition size
are chosen such that each partition can be accommodated in the main memory so that the
partitions are read only once in each phase.
Partition algorithm
partition database (T);n= number of partitions
//Phase 1
for i=1 to n do begin
read_ in_artition( Ti in P)
Li= generate all frequent itemsets of Ti using a apriori method in main memory. End
merge phase
for (K=2;Lk=0 i=1,2......n;k++) do begin
//Phase II
For i=1 to do begin
read_in_partition(Ti in P)
for all candidates c€Cg compute s(c)Ti;
Lc=IC€CgIs(c)r-*)
Answer=Lg
Example

4 Write a note on Pincer Search algorithm


PINCER-SEARCH ALGORITHM One can see that the a priori algorithm operates in a bottom-
up, breadth-first search method. The computation starts from the smallest set of frequent
itemsets and moves upward till it reaches the largest frequent itemset. The number of
database passes is equal to the largest size of the frequent itemset. When any one of the
frequent itemsets becomes longer, the algorithm has to go through many iterations and, as
a result, the performance decreases. The pincer-search algorithm is based on this principle.
It attempts to find the frequent itemsets in a bottom-up manner but, at the same time, it
maintains a list of maximal frequent itemsets. While making a database pass, it also counts
the sup- port of these candidate maximal frequent itemsets to see if any one of these is
actu- ally frequent. In that event, it can conclude that all the subsets of these frequent sets
are going to be frequent and, hence, they are not verified for the support count in the next
pass. If we are lucky, we may discover a very large maximal frequent itemset very early in
the algorithm

5 Write advantage and disadvantages of decision tree


----------------
6 Explain the phases in the construction of the decision tree .
---------------------------
7 What neural network. Explain neural network’s perceptron model
A perceptron is the most fundamental unit which is used to build a neural network. A
perceptron resembles a neuron in the human brain. In the case of a biological neuron,
multiple input signals are fed into a neuron via dendrite, and an output signal is fired
appropriately based on the strength of the input signal and some other mechanism. The
diagram below represents how input signals pass through the biological neuron.

Biological neuron
In the case of a perceptron, the input signals are combined with different weights and fed into the
artificial neuron or perceptron along with a bias element. Representationally, within perceptron, the
net sum is calculated as the sum of weights and input signal, and a bias element, then, the net sum is
fed into a non-linear activation function. Based on the activation function, the output signal is sent
out. The diagram below represents a perceptron. Notice the bias element b and sum of weights and
input signals represented using x and w. The threshold function represents the non-linear activation
function.

Perceptron representing a biological neuron


A perceptron can also be called a single-layer neural network as it is the part of the only layer in the
neural network where the computation occurs. The computation occurs on the summation of input
data that is fed into it. To understand greater details around perceptron, here is my post –
Perceptron explained with Python example. The way that Rosenblatt Perceptron differed from
McCulloch-Pitts Neuron (1943) is that in the case of Perceptron, the weights were learned based on
the input data pertaining to different categories.

The perceptrons laid out in form of single-layer or multi-layer neural networks can
be used to perform both regression and classification tasks. The diagram below represents the
single-layer neural network (perceptron) that represents linear regression (left) and softmax
regression (each output o1, o2, and o3 represents logits). Recall that the Softmax regression is a
form of logistic regression that normalizes an input value into a vector of values that follows a
probability distribution whose total sums up to 1. The output values are in the range [0,1]. This
allows the neural network to predict as many classes or dimensions as required. This is why softmax
regression is sometimes referred to as a multinomial logistic regression.

Perceptron single layer neural network regression 2

Here is a picture of a perceptron represented as the sum of the summation of inputs with weights
(w1, w2, w3, wm) and bias element, that is passed through the activation function and the final
output is obtained. This can be very well used for both regression and binary classification problem
Perceptron

Fig. Perceptro

8. Write a note on decision tree


*What is a Decision Tree
A decision tree is a classification scheme which generates a tree and a set of rules representing
the model of different classes, from a given data set. The set of records available for developing
classification methods is divided into two disjoint subsets—a training set and a test set.
Training set is for deriving the classifier and the test set used to measure the accuracy of the
classifier.
*Advantages of Decision Trees:
•Decision trees are able to generate understandable rules.
•Decision trees are able to handle both numerical and categorical attributes.
•Decision trees provide a clear indication of which fields are most important for prediction or
classification

*Disadvantages of Decision Trees:


•Some decision trees can only deal with binary valued target classes.
•They become error-prone when the number of training examples per class gets small.
• The process of growing a decision tree is expensive.

*Tree Construction Principle:


•Splitting Attribute: With every node of the decision tree, there is an associated attribute whose
values determine the partitioning of the data set when the node is expanded.
•Splitting Criterion: The qualifying criterion on the splitting attribute for data set splitting at a
node, is called the splitting criterion at that node. For a numeric attribute, the criterion can be
an equation or inequality.

*The construction of decision tree involves the following three main phases.
•Construction phase: The initial decision tree is constructed in this phase, based on the
entire training set. It requires recursively partitioning the training set into two or more
sub-partitions using splitting criteria until stopping criteria is met.

• Pruning Phase: The tree constructed in the previous phase may not result in the best
possible set of rules due to over-fitting. The pruning phase removes some of the lower
branches and nodes to improve the performance.
•Processing the pruned tree: This is done to increase the understandability

9. Explain the typical artificial neurons with activation function.


----------------
10. What is unsupervised learning? Explain
-----------------
11. What is RBFN? Explain with a neat diagram
Radial Basis Function Networks (RBF)Radial Basis Function (RBF) networks are also feed forward,
but have only one hidden layer. The primary difference between MLP and RBF is the hidden layer.
RBF hidden layer units have a receptive field which has a centre; that is, a particular input value at
which they have a maximum output. Generally, the hidden units have a Gaussian transfer function
12. Explain support vector machine
Support Vector Machines:
Support vector machines are learning machines that can perform binary classification and regression
estimation tasks. SVMs are also recognized as efficient tools for data mining and are popular
because of two important factors. First, unlike the other classification techniques, SVMs minimize
the expected error rather than minimizing the classification error .Second, SVMs employ the duality
theory of mathematical programming to get a dual problem that admits efficient computational
methods. SVMs represent a paradigm of classification techniques and include several classes of
techniques, which differ among themselves, based on the kernel function and the linear or non
linear separating surfaces between classes. The minimization of error or maximization of the
effectiveness of classification are the key ideas in building a decision tree or segmenting the data set
into clusters. Instead of minimizing the error ,SVMs incorporate structured risk minimization into the
classification .By structured risk minimization, we mean minimizing an upper bound on the
generalization error .In a supervised classification error is calculated based on the ratio (or
percentage )of misclassified test data items to the whole test data set. The generalization error is
less if the test data exhibit a similar classification as the training data. The generalization error is high
if these two data sets contradict each other in terms of representing class boundaries .SVMs create a
classifier with minimized exp.

13. Write a note on neural network perception model


------------------------
14. Describe the application areas of neural networking
--------------------
15. Describe the learning technique in Multi layer perceptron
------------------------------------

16. Explain RBF networks.


Radial Basis Function Networks (RBF)Radial Basis Function (RBF) networks are also feed forward, but
have only one hidden layer. The primary difference between MLP and RBF is the hidden layer. RBF
hidden layer units have a receptive field which has a centre; that is, a particular input value at which
they have a maximum output. Generally, the hidden units have a Gaussian transfer function

17. Write a note on best split

Best SplitTo build an optimal decision tree it is necessary to select an attribute corresponding to the
best possible split. The main operations during tree building are1. Evaluation of splits for each
attribute and selection of the best split; determination of the splitting attribute.2. Determination of
the splitting condition on the selected splitting attribute3. Partitioning the data using the best
split.The complexity lies in determining the best split for each attribute. The splitting also depends
on the domain of the attribute being numerical or categorical. The generic algorithm assumes that
the splitting attribute and splitting criteria are known. The desirable feature of splitting is that it
should do the best job of splitting at a given stage. The first task is to decide which of the
independent attribute is the best splitter. If the attribute takes multiple values, we sort it and then
use some evaluation function to measure its goodness. We compare the effectiveness of the split
provided by the best splitter from each attribute. The winner is chosen as the splitter for the root
node.

18. Define rough sets, Information system and indescrenibility Relation


19. Discuss the principle of rough set theory with a suitable example.
It is a tool of sets and relations for studying impression vagueness and uncertainity in data
analysis applied to drive rules to provide reasoning and discover relationships in qualitative,
incomplete or imprecise data.
The rough set is a pair(A*(X), A(X)). If the boundary region is empty set, then X is called crisp set else
X is called rough set.
So, X is a rough set. The measure of roughness : ~= |A*(X)|/|A*(X)|.
Example :-
The rough set is the approximation of vague concept by a pair precise concepts called the lower and
upper approximation. The foundation of rough set theory is based on the assumption that for every
object in the universe however precise it is there is some associate crisp concept.
A set is said to be rough if its boundary region is non-empty otherwise the set is crisp.
Diagram :-

20. Describe the competitive learning technique employed in Kohonen’s SOM


-------------

You might also like