You are on page 1of 11

DATA MINING ASSIGNMENT

1. With a neat sketch explain the architecture of a data warehouse.


Ans.

1.The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation. The
data are extracted using application program interfaces known as gateways. A gateway is
supported by the underlying DBMS and allows client programs to generate SQL code to be
executed at a server. Examples of gateways include ODBC (Open Database Connection) and
OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.

2. The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or
(2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and reporting tools,analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

2.Discuss the typical OLAP operations with an example.
Ans. Roll-up
Generalises one or a few dimensions and performs appropriate aggregations in the corresponding
measures.
For non spatial measures aggregation is implemented in the same way as in non-spatial data
cubes.
For spatial measures, the aggregate takes a collection of spatial pointers.
Used for map-overlay.
Performs spatial aggregation operations such as region merge etc.

Drill-down
Specialises one or a few dimensions and presents lowlevel data.
Can be viewed as a reverse operation of roll-up.
Can be implemented by saving a low- level cube and performing a generalisation on it when
necessary.

Slicing and dicing
Selects a portion of the cube based on the constant(s) in one or a few dimensions.
Can be done with regular queries.

Pivoting
Presents the measures in different cross-tabular layouts.
Can be implemented in a similar way as in non-spatial cubes.

3.Discuss how computations can be performed efficiently on data cubes.
Ans. General Strategies for Cube Computation
1: Sorting, hashing, and grouping.
2: Simultaneous aggregation and caching intermediate results.
3: Aggregation from the smallest child, when there exist multiple child cuboids.
4: The Apriori pruning method can be explored to compute iceberg cubes efficiently
Cube Computation Algorithms
Bottom-up
first compute base cuboid, work its way up the lattice to apex cuboid
e.g., multiway algorithm
Top-down
start with apex cuboid, work its way down to base cuboid
e.g., BUC algorithm

4.Write short notes on data warehouse meta data.
Ans. Metadata in a data warehouse contains the answers to questions about the data in the data
warehouse.
Here is a sample list of definitions:
Data about the data
Table of contents for the data
Catalog for the data
Data warehouse atlas
Data warehouse roadmap
Data warehouse directory
Glue that holds the data warehouse contents together
Tongs to handle the data
The nerve center
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database
management system. In the data dictionary, you keep the information about the logical data
structures, the information about the files and addresses, the information about the indexes, and
so on. The data dictionary contains data about the data in the
database.Similarly, the metadata component is the data about the data in the data warehouse.This
definition is a commonly used definition.Metadata in a data warehouse is similar to a data
dictionary, but much more than a data dictionary.
Types of Metadata
Metadata in a data warehouse fall into three major categories:
1. Operational Metadata
2. Extraction and Transformation Metadata
3. End-User Metadata

5.Explain various methods of data cleaning in detail
Parsing: Parsing in data cleansing is performed for the detection of syntax errors. A
parser decides whether a string of data is acceptable within the allowed data specification.
This is similar to the way a parser works with grammars and languages.
Data transformation: Data transformation allows the mapping of the data from its given
format into the format expected by the appropriate application. This includes value
conversions or translation functions, as well as normalizing numeric values to conform to
minimum and maximum values.
Duplicate elimination: Duplicate detection requires an algorithm for determining
whether data contains duplicate representations of the same entity. Usually, data is sorted
by a key that would bring duplicate entries closer together for faster identification.
Statistical methods: By analyzing the data using the values of mean, standard deviation,
range, or clustering algorithms, it is possible for an expert to find values that are
unexpected and thus erroneous. Although the correction of such data is difficult since the
true value is not known, it can be resolved by setting the values to an average or other
statistical value. Statistical methods can also be used to handle missing values which can
be replaced by one or more plausible values, which are usually obtained by extensive
data augmentation algorithms.
6.Give an account on data mining Query language.
Ans. DMQL has been designed at the Simon Fraser University,Canada. It has been designed to
support various rule mining extractions (e.g., classification rules, comparison rules, association
rules). In this
language, an association rule is a relation between the values of two sets of predicates that are
evaluated on the relations of a database. These predicates are of the form P(X, c) where P is a
predicate taking the name of an attribute of a relation, X is a variable and c is a value in the
domain of the attribute. A typical example of association rule that can be extracted by DMQL is
buy(X, milk) A t m n ( X , Berlin) + buy(X, beer). An important possibility in DMQL is the
definition of meta-patterns, i.e., a powerful way to restrict the syntactic aspect of the extracted
rules (expressive syntactic constraints).
For instance, the meta-pattern buy+(X, Y) A town(X, Berlin) + buy(X, 2) restricts the search to
association rules concerning implication between bought products for customers living in Berlin.
Symbol + denotes that the predicate buy can appear several times in the left part of the rule.
Moreover, beside the classical frequency and confidence, DMQL also enables to define
thresholds on the noise or novelty of extracted rules. Finally, DMQL enables to define a
hierarchy on attributes such that generalized association rules can be extracted. The
general syntax of DMQL for the extraction of association rules is the following one:
Use database (database-name)
{use hierarchy (hierarchy-name)
For (attribute) )
Mine associations [as (pattern-name)]
[ Matching (metapattern) I
From (relation(s)) [ Where (condition)]
[ Order by (order-list) I
[ Group by (grouping1ist)I [ Having (condition)]
With (interest-measure)
Threshold = value

7.How is Attribute-Oriented Induction implemented? Explain in detail.
Ans.The attribute oriented induction method has been implemented in data mining system
prototype called DBMINER which previously called DBLearn and been tested successfully
against large relational database and datawarehouse for multidimensional purposes.
For the implementation attribute oriented can be implemented as the architecture design where
characteristic rule and classification rule can be learned straight from transactional database
(OLTP) or Data warehouse (OLAP) with the helping of concept hierarchy as knowledge
generalization. Concept hierarchy can be created from OLTP database as a direct resource.
For making easy the implementation a concept hierarchy will just only based on non rule based
concept hierarchy and just learning for characteristic rule and classification/discriminant rule.
1) Characteristic rule is an assertion which characterizes the concepts which satisfied by all of
the data stored in database. Provide generalized concepts about a property which can help people
recognize the common features of the data in a class. For example the symptom of the specific
disease.
2) Classification/Discriminant rule is an assertion which discriminates the concepts of 1 class
from other classes. Give a discriminant criterion which can be used to predict the class
membership of of new data.For example to distinguish one disease from others, classification
rule should summarize the symptoms that discriminate this disease from others.

8.Write and explain the algorithm for mining frequent item sets without candidate
generation.
Ans. FP-tree construction
Input: A transaction database DB and a minimum support threshold ?.
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
Scan the transaction database DB once. Collect F, the set of frequent items, and the support of
each frequent item. Sort F in support-descending order as FList, the list of frequent items.
Create the root of an FP-tree, T, and label it as null. For each transaction Trans in DB do the
following:
Select the frequent items in Trans and sort them according to the order of FList. Let the sorted
frequent-item list in Trans be [ p | P], where p is the first element and P is the remaining list. Call
insert tree([ p | P], T ).
The function insert tree([ p | P], T ) is performed as follows. If T has a child N such that N.item-
name = p.item-name, then increment N s count by 1; else create a new node N , with its count
initialized to 1, its parent link linked to T , and its node-link linked to the nodes with the same
item-name via the node-link structure. If P is nonempty, call insert tree(P, N ) recursively.


9.Discuss the approaches for mining multi level association rules from the transactional
databases. Give relevant example.
Ans.In general,a top-down strategy is employed,where counts are accumulated for the
calculation of frequent itemsets at each concept level,starting at the concept level1 and working
towards the lower,more specific concept levels,until no more frequent itemsets can be
found.That is,once all frequent itemsets at concept levels,until no more frequent itemsets can be
found.That is, once all frequent itemsets at concept level 1 are found,then the frequent itemsets at
level2 are found,and so on,For each level,any algorithm for discovering frequent itemsets may be
used,such as Apriori or its variations.
Using uniform minimum support for all levels(referred to as uniform supprort) The same
minimum support threshold is used when mining at each level of abstraction.When a uniform
minimum support threshold is used,the search procedure is simplified.The method is also simple
in that users are required to specify only one minimum support threshold.An optimization
technique can be adopted,based on the knowledge that an ancestor is a superset of its
descendents;the search avoids examining itemsets contaiining any item whose anestors do not
have minimum support.
The uniform supprot approach,however,has some difficulties.It is unlikely that items at lower
levels of abstraction will occur as frequently as those at higher levels of abstraction.If the
minimum support threshold is set too high,it could miss several meaningful associations
occuring at low abstraction levels.If the threshold is set too low,it may generate many
uninteresting associations occuring at high abstraction levels.This provides the motivation for the
following approach.
Using reduced minimum support at lower levels (referred to as reduced sup-port);Each level of
abstraction has its own minimum support threshold.The lower the abstraction level,the smaller
the
corresponding threshold.
For mining multiple-level associations with reduced support,there are a number of alternative
search
strategies:
Level-by-Level independent: This is a full-breadth search,where no background knowledge of
frequent itemsets is used for pruning.Each node is examined,regardless of whether or not its
parent node is found to be frequent.
Level -cross-filtering by single item: An item at the ith level is examined if and only if its
parent node at the (i-1)th level is frequent .In other words ,we investigate a more specific
association from a more general one.If a node is frequent,its children will be
examined;otherwise,its descendents are pruned from the search.
Level-cross filtering by -K-itemset: A A-itemset at the ith level is examined if and only if its
corresoponding parent A-itemset at the (i-1)th level is frequent.

10.Explain the algorithm for constructing a decision tree from training samples.
Ans. Top-Down Decision Tree Induction Schema:
BuildTree(Node n, datapartition D, algorithm C L)
1. Apply CL To D to find crit(n)
2. Let kS be the number of children of n
3. If(k>0)
4. Create k children c1..ck of n
5. Use best split to partition D into D1Dk
6. For(i=1;i<k;i++)
7. BuildTree(ci,Di)
8. endFor
9. endIf
RainForest Refinement:
1. for each predictor attribute p
2. Call CL find_best_partition(AVC-set of p)
3. endfor
4. k=CL decide_splitting_criterion();


11.Explain Baye's theorem.
Ans.Bayes' theorem gives the relationship between the probabilities of A and B, P(A) and P(B),
and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A). In its most
common form, it is:

The meaning of this statement depends on the interpretation of probability ascribed to the terms
When applied, the probabilities involved in Bayes' theorem may have any of a number of
probability interpretations. In one of these interpretations, the theorem is used directly as part of
a particular approach to statistical inference. ln particular, with the Bayesian interpretation of
probability, the theorem expresses how a subjective degree of belief should rationally change to
account for evidence: this is Bayesian inference, which is fundamental to Bayesian statistics.
However, Bayes' theorem has applications in a wide range of calculations involving
probabilities, not just in Bayesian inference.
Bayes' theorem is to the theory of probability what Pythagoras's theorem is to geometry.
In the Bayesian (or epistemological) interpretation, probability measures a degree of belief.
Bayes' theorem then links the degree of belief in a proposition before and after accounting for
evidence. For example, suppose somebody proposes that a biased coin is twice as likely to land
heads than tails. Degree of belief in this might initially be 50%. The coin is then flipped a
number of times to collect evidence. Belief may rise to 70% if the evidence supports the
proposition.
For proposition A and evidence B,
P(A), the prior, is the initial degree of belief in A.
P(A|B), the posterior, is the degree of belief having accounted for B.
the quotient P(B|A)/P(B) represents the support B provides for A.

12.Explain the following clustering methods in detail:
(i) BIRCH,
Ans.BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised
data mining algorithm used to perform hierarchical clustering over particularly large data-sets.
An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multi-
dimensional metric data points in an attempt to produce the best quality clustering for a given set
of resources (memory and time constraints). In most cases, Birch only requires a single scan of
the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the
database area to handle 'noise' (data points that are not part of the underlying pattern)
effectively".
Advantages with BIRCH
It is local in that each clustering decision is made without scanning all data points and currently
existing clusters. It exploits the observation that data space is not usually uniformly occupied and
not every data point is equally important. It makes full use of available memory to derive the
finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that
does not require the whole data set in advance
(ii) CURE
Ans. CURE (Clustering Using REpresentatives) is an efficient data clustering algorithm for
large databases that is more robust to outliers and identifies clusters having non-spherical shapes
and wide variances in size. To avoid the problems with non-uniform sized or shaped clusters,
CURE employs a novel hierarchical clustering algorithm that adopts a middle ground between
the centroid based and all point extremes. In CURE, a constant number c of well scattered points
of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction .
The scattered points after shrinking are used as representatives of the cluster. The clusters with
the closest pair of representatives are the clusters that are merged at each step of CURE's
hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes
it less sensitive to outliers
The running time of the algorithm is O(n
2
log n) and space complexity is O(n).
The algorithm cannot be directly applied to large databases. So for this purpose we do the
following enhancements
Random sampling
Partitioning for speed up
Labeling data on disk

13.What is a multimedia database? Explain the methods of mining multimedia database?
Ans.The multimedia database systems are to be used when it is required to administrate a huge
amounts of multimedia data objects of different types of data media (optical storage, video
tapes, audio records, etc.) so that they can be used (that is, efficiently accessed and searched)
for as many applications as needed.

recordings, signals, etc., that are digitalized and stored.
Basic service:

-user system (like program interface)
the methods of mining multimedia database
Classification models
Machine learning (ML) and meaningful information extraction can only be realized, when some
objects
have been identified and recognized by the machine.The object recognition problem can be
referred as a
supervised labeling problem. Starting with the supervised models, we mention the decision trees.
An overview of existing works in decision trees is provided in . Decision trees can be translated
into a set of rules by creating a separate rule for each path from the root to a leaf in the tree.
However, rules can also be directly induced from training data using a variety of rule-based
algorithms.
Clustering Models
In unsupervised classification, the problem is to group a given collection of unlabeled
multimedia
files into meaningful clusters according to the multimedia content without a priori knowledge.
Clustering algorithms can be categorized into partitioning methods, hierarchical methods,
density-based methods, grid-based methods, and model-based methods.
Association rules
The most association rules studies have been focusing on the corporate data typically in
alphanumeric databases . There are three measures of the association: support, confidence and
interest. The support factor indicates the relative occurrence of both X and Y within the overall
data set of transactions. It is defined as the ratio of the number of instances satisfying both X and
Y over the total number of instances. The confidence factor is the probability of Y given X and is
defined as the ratio of the number of instances satisfying both X and Y over the number of
instances satisfying X. The support factor indicates the frequencies of the occurring patterns in
the rule, and the confidence factor denotes the strength of implication of the rule.

14.(i) Discuss the social impacts of data mining.
Ans. Profiling information is collected every time
You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or
apply for any of the above
You surf the Web, reply to an Internet newsgroup, subscribe to a magazine, rent a video,
join a club, fill out a contest entry form,
You pay for prescription drugs, or present you medical care number when visiting the
doctor
Collection of personal data may be beneficial for companies and consumers, there is also
potential for misuse

(ii) Discuss spatial data mining.
Ans. Spatial data mining is the process of discovering interesting, useful, non-trivial patterns
from large spatial datasets. Spatial data mining is the application of data mining methods to
spatial data. The end objective of spatial data mining is to find patterns in data with respect to
geography. So far, data mining and Geographic Information Systems (GIS) have existed as two
separate technologies, each with its own methods, traditions, and approaches to visualization and
data analysis. Particularly, most contemporary GIS have only very basic spatial analysis
functionality. The immense explosion in geographically referenced data occasioned by
developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes
the importance of developing data-driven inductive approaches to geographical analysis and
modeling. Challenges in Spatial mining: Geospatial data repositories tend to be very large.
Moreover, existing GIS datasets are often splintered into feature and attribute components that
are conventionally archived in hybrid data management systems. Algorithmic requirements
differ substantially for relational (attribute) data management and for topological (feature) data
management.

15.Write the A priori algorithm for discovering frequent item sets for mining single-
dimensional Boolean Association Rule and discuss various approaches to improve its
efficiency.
Ans.The first pass of the algorithm simply counts item occurrences to determine the frequent 1-
itemsets. A subsequent pass, say pass k,consists of two phases. First, the frequent itemsets Lk-1
found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori
candidate generation function described below. Next, the database is scanned and the support of
candidates in Ck is counted. For fast counting, we need to efficiently determine the candidates in
Ck that are contained in a given transaction T. We now describe candidate generation,finding the
candidates which are subsets of a given transaction, and then discuss bufer management.
L1 := frequent 1-itemsets;
k := 2; // k represents the pass number
while ( Lk!=0 ) do begin
Ck := New candidates of size k generated from Lk-1;
For all transactions T 2 D do begin
Increment the count of all candidates in Ck that are contained in T.
end
Lk := All candidates in Ck with minimum support.
k := k + 1;
end
Answer := Sk Lk;

16.Explain the different categories of clustering methods?
Ans.1. Partitioning Methods
The partitioning methods generally result in a set of M clusters, each object belonging to one
cluster. Each cluster may be represented by a centroid or a cluster representative; this is some
sort of summary description of all the objects contained in a cluster. The precise form of this
description will depend on the type of the object which is being clustered.
2. Hierarchical Agglomerative methods
The construction of an hierarchical agglomerative classification can be achieved by the following
general algorithm.
1.Find the 2 closest objects and merge them into a cluster
2.Find and merge the next two closest points, where a point is either an individual object or
a cluster of objects.
3.If more than one cluster remains, return to step 2

3.The Single Link Method (SLINK)
The single link method is probably the best known of the hierarchical methods and operates by
joining, at each step, the two most similar objects, which are not yet in the same cluster. The
name single link thus refers to the joining of pairs of clusters by the single shortest link between
them.
4. The Complete Link Method (CLINK)
The complete link method is similar to the single link method except that it uses the least similar
pair between two clusters to determine the inter-cluster similarity (so that every cluster member
is more like the furthest member of its own cluster than the furthest item in any other cluster ).
This method is characterized by small, tightly bound clusters.
5.The Group Average Method
The group average method relies on the average value of the pair wise within a cluster, rather
than the maximum or minimum similarity as with the single link or the complete link methods.
Since all objects in a cluster contribute to the inter cluster similarity, each object is , on average
more like every other member of its own cluster then the objects in any other cluster.
6.Text Based Documents
In the text based documents, the clusters may be made by considering the similarity as some of
the key words that are found for a minimum number of times in a document. Now when a query
comes regarding a typical word then instead of checking the entire database, only that cluster is
scanned which has that word in the list of its key words and the result is given. The order of the
documents received in the result is dependent on the number of times that key word appears in
the document.


17.Explain the back propagation algorithm for neural network- based classification of
data.
Ans.The back propagation algorithm performs learning on a multilayer fee-forward neural
network. The inputs correspond to the attributes measured for each raining sample. The inputs
are fed simultaneously into layer of units making up the input layer. The weighted outputs of
these units are, in turn, fed simultaneously to a second layer of neuron like units, known as a
hidden layer. The hidden layer s weighted outputs can be input to another hidden layer, and so
on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The
weighted outputs of the last hidden layer are input to units making up the output layer, which
emits the networks prediction for given samples.
The units in the hidden layers and output layer are sometimes referred to as neurodes, due to
their symbolic biological basis, or as output units. Multilayer feed-forward networks of
linear threshold functions, given enough hidden units, can closely approximate any function.
Back propagation learns by iteratively processing a set of training samples, comparing the
networks predicition for each sample with the actual known class label. For each training
sample, the weights are modified so as to minimize the mean squared error between the
networks prediction and the actual class.
These modifications are made in the backwards direction, that is , form the output layer
through each hidden layer down to the first hidden layer (hence the name backpropagation).
Although it is not guaranteed in general the weights will eventually converge, and the learning
process stops. The algorithm is summarized below.
Initialize the weights. The weights in the network are initialized to small random
number(e.g., ranging from -1.0 to1.0,or -0.5 to 0.5).
Each unit has a bias associated with it.
The biases are similarly initialized to small random numbers.
Each training sample: X, is processed by the following steps:
Propagate the inputs forward
Back propagate the error

You might also like