Professional Documents
Culture Documents
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Task-relevant data selection
Condition for data
Database or data warehouse
Relevant attributes name
or dimensions
Database tables or data warehouse
cubes
Knowledge type to be mined
Condition for dataClustering
Characterization, selection
Relevant attributes or dimensions
Discrimination, Association
Classification/prediction
Knowledge type to be mined
Characterization, Clustering
Discrimination, Association
Classification/prediction
Background knowledge
Concept hierarchies
User beliefs about relationships in the data
Background knowledge
Concept hierarchies
User beliefs about relationships in the
data
Pattern interestingness measures
Simplicity
Certainity (e.g., confidence)
Utility
Novelity
1
A popularly used visual representation of a distribution is the box plot. In a box plot the end
of the box are at the quartiles, so that the box length is the inter quartile range.
c.Explain concept description.
The simplest kind of description data mining is concept description. Concept description is not a
simple enumeration of data. Instead, concept description generates descriptions for
characterization and comparison of the data. It is some times called class description, when the
concept to be described refers to a class of objects characterization provides a concise and
succinct summarization of the given collection of data, while concept or class comparison
provides descriptions comparing two or more collections of data.
d. Define attribute oriented induction.
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures
How it is done?
i. Collect the task-relevant data (initial relation) using a relational database query
ii. Perform generalization by attribute removal or attribute generalization
iii. Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts
iv. Interactive presentation with users
Basic Principles of Attribute-Oriented Induction
i.Data focusing: task-relevant data, including dimensions, and the result is the initial relation
ii.Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there
is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of
other attributes
iii.Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A
iv.Attribute-threshold control: typical 2-8, specified/default
v.Generalized relation threshold control: control the final relation/rule size
II. Answer any Three form the following.
1. Explain statistical measures in large data bases?
For many data mining tasks, however, users would like to learn more data characteristics
regarding both central tendency and data dispersion. Measures of central tendency include mean,
median, mode, and midrange, while measures of data dispersion include Quartiles, outliers, and
variance. These descriptive statistics are of great help in understanding the distribution of data.
Measuring the central tendency:
2
The most common and most effective numerical measure of the “center” of a set of data
is the (arithmetic) mean. Let x 1,x 2 , . . ., x n be a set of n values or observations. The mean of this
set of values is
_ 1 n
x = − ∑ xi .
n i=1
Sometimes , each value of xi in a set may be associated with a weight wi , for i=1,. . .,n.
The weights reflect the significance , importance , or occurrence frequency attached to their
respective values. In this case, we can compute
n
∑ wixi .
i=1
_
x = -------------------
n
∑ wi
i=1
4
of this preprocessing step into class characterization or comparison is referred to as analytical
characterization or analytical comparison, respectively.
The first limitation of class characterization for multidimensional data analysis in data
warehouse and OLAP tools is the handling of complex objects. The second limitation is the lack
of an automated generalization process: the user must explicitly tell the system which
dimensions should be included in the class characterization and to how high a level each
dimension should generalized. Actually, each step of generalization or specialization on any
dimension must be specified by the user.
Methods should be introduced to perform attribute relevance analysis in order to filter out
statistically irrelevant or weakly relevant attributes, and retain or even rank the most relevant
attributes for the descriptive mining task at hand. Class characterization that includes the analysis
of attribute/dimension relevance is called analytical characterization. Class comparison that
includes such analysis is called analytical comparison.
Methods of Attribute Relevance Analysis:
The general idea behind attribute relevance analysis is to compute some measure that is
used to quantify the relevance of an attribute with respect to a given class or concept.
Suppose that there are m classes. Let S contain si samples of class ci, for i=1,…,m. An
arbitrary sample belongs to class ci with probability si/s, where s is the total number of samples in
set S. The expected information needed to classify a given sample is
m si si
I(s1,s2 , . . . , sm ) = − ∑ −− log2 −−
i=1
s s
An attribute A with values {a1,a2,…,av} can be used to partition S into the subsets {S 1,S2,…,Sv},
where Sj contains those samples in S that have value a j of A. Let Sj contain s ij samples of class
Ci. The expected information based on this partitioning by A is known as the entropy of A. It is
the weighted average:
v s1j+ . . . +smj
E(A) = ∑ ------------------ I(s1j , . . ., s mj )
j=1
s
The information gain obtained by this partitioning on A is defined by
Gain(A) = I (s1,s2, . . . , sm ) − E(A)
Attribute relevance analysis for concept description is performed as follows:
a. Data Collection:
Collect data for both the target class and the contrasting class by query
processing. For class comparison, both the target class and the contrasting class are
provided by the user in the data mining query. For class characterization, the target class
5
is the class to be characterized, whereas the contrasting class is the set of comparable data
that are not in the target class.
if ---------------------- ≥ min_conf ,
support_count(s)
Where min_conf is the minimum confidence threshold .since the rules are
generated from frequent itemsets, each one automatically satisfies minimum support. Frequent
itemsets can be stored ahead of time in hash tables along with their counts so that they can be
accessed quickly.
In constraint-based mining, mining is performed under the guidance of various kinds of
constraints provided by the users.
These constraints include the following:
Knowledge type constraints:
These specify the type of knowledge to be mined, such as association.
Data constraints:
These specify the set of task-relevant data.
Dimension/level constraints:
7
These specify the dimension of the data, or levels of the concept hierarchies, to be used.
Interestingness constraints:
These specify thresholds on statistical measures of rule interestingness, such as support
and confidence.
Rule constraints:
These specify the form of rules to be mined. Such constraints may be expressed as meta-
rules (rule templates), as the maximum or minimum number of predicates that can occur in the
rule antecedent or consequent, or as relationships among attributes, attribute values, and/or
aggregates.
The above constraints can be specified using a high-level declarative data mining query
language.
Constraint-based mining allows users to specify the rules to be mined according to their
intention, thereby making the data mining process more effective. In addition, a sophisticated
mining query optimizer can be used to exploit the constraints specified by the user; thereby
making the mining process more efficient. Constraint-based mining encourages interactive
exploratory mining and analysis.
3. How Graphical User Interfaces are designed based on a Data Mining?
A data mining query language provides necessary primitives that allow users to
communicate with data mining systems. However, inexperienced users may find data mining
query languages awkward to use and the syntax difficult to remember.
Instead , users may prefer to communicate with data mining systems through a GUI.
A data mining query language may serve as a core language for data mining system
implementations , providing a basis for the development of GUI’s for effective data mining.
A data mining GUI may consists of the following functional components:
i. Data collection and data mining query composition:
This component allows the user to specify task-relevant data sets and to compose
data mining queries. It is similar to GUI’s used for specification of relational queries.
ii. Presentation of discovered patterns:
This component allows the display of the discovered patterns in various forms
including tables , graphs, charts, curves and other visualization techniques.
iii. Hierarchy specification and manipulation:
This component allows for concept hierarchy specification either manually by the
user or automatically. This component should allow concept hierarchies to be modified by the
user or adjusted automatically based on a given dataset distribution.
iv. Manipulation of data mining primitives:
8
This component may allow the dynamic adjustment of the data mining thresholds,
as well as the selection , display, and modification of concept hierarchies.
v. Interactive multilevel mining:
This component should allow roll-up or drill-down operations on discovered patterns.
vi. Other miscellaneous information:
This component may include online help manuals, indexed search, debugging,
and other interactive graphical facilities.
4. Suppose that the university course data base for Big University include the following
attributes, student name, address, status(undergraduate, graduate), major and GPA.
a. propose a concept hierarchy for the attributes address, status, major and
GPA.
Ans. Address:
All Universe
India US
Major Branches
9
Science Arts
GPA:
Good Very Good Excellent
Excelle
-----@@@@@@@---
10