You are on page 1of 31

1.

Explain complete set of frequent item sets without candidate


generation using with example.
FP-growth Algorithm:
- Bottlenecks of the Apriori approach
a. Breadth-first (i.e., level-wise) search
b. Candidate generation and test
Often generates a huge number of candidates
- The FP-Growth Approach
a. Depth-first search
b. Avoid explicit candidate generation
- Major philosophy: Grow long patterns from short ones using local
frequent items only
a. “abc” is a frequent pattern
b. Get all transactions having “abc”, i.e., project DB on abc: DB|abc
c. “d” is a local frequent item in DB|abc  abcd is a frequent pattern.
- Develop an efficient, FP-tree-based frequent pattern mining method
a. A divide-and-conquer methodology: decompose mining tasks into
smaller ones.
b. Avoid candidate generation: sub-database test only!
 Scan DB once, find frequent 1-itemset (single item pattern)
 Order frequent items in frequency descending order
 Scan DB again, construct FP-tree
 Mine FP-tree – by constructing its conditional pattern-base, and
then its conditional FP-tree.
Example:
- The first scan of the database is the same as Apriori, which derives the
set of frequent items (1-itemsets) and their support counts. The set of
frequent items is sorted in the order of descending support count.
- This resulting set or list is denoted L. Thus, L = {{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}
- An FP-tree is then constructed as follows.
- First, create the root of the tree, labeled with “null.”
- Scan database D second time. The items in each transaction are
processed in L order, and a branch is created for transaction.
- In general, when considering the branch to be added for a transaction,
the count of each node along a common prefix is incremented by 1, and
nodes for the items following the prefix are created and linked
accordingly.
2. How will you solve a classification problem using Bayesian
Classification.
Based on Bayes’ Theorem. There are two types of Bayesian classification
methods.
- Naïve Bayesian Classifier: A simple Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers.
This classifier assume that the effect of an attribute value on a given
class is independent of the values of other attributes.
- Bayesian belief networks: These are graphical models, which allows the
representation of dependencies among subsets of attributes.
Naïve Bayesian Classification: To Derive the Maximum Posteriori
- Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = {x 1, x2, . . ., xn}
- Suppose there are m classes C1, C2, . . ., Cm
- Classification is to derive the maximum posteriori, i.e., the maximal P(C i|
X).
- This can be derived from Bayes’ theorem P(C i|X) = P(X|Ci) x P(Ci) / P(X)
- Since P(X) is constant for all classes, only P(C i|X) = P(X|Ci) x P(Ci), needs
to be maximized.
- A simplified assumption: attributes are conditionally independent:
P(X|Ci) = nπk=1 P(xk | Ci)
- This greatly reduces the computation cost: Only counts the class
distribution.
Training Dataset:

- P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14 = 0.357
- Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<=30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
- X = (age <= 30, income = medium, student = yes, credit_rating = fair)
- P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
- P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
- P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)
=0.643*0.044 = 0.028
- P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
- Therefore, X belongs to class (“buys_computer = yes”)
Bayesian Belief Network:
- These provide graphical model of causal relationship and they allow class
dependencies between subsets of variables.
- A belief network is defined by 2 components (i) A directed acyclic graph
(ii) A set of conditional probability table.
- Each node in directed acyclic graph represents a random variable. Each
arc represents a probabilistic dependence. If an arc is drawn from node
Y to node Z, then Y is parent of Z and Z is descendent to Y.
- Each variable is conditionally independent of its non-descendants in the
graph, given its parents.
Incremental Network Construction:
a. Choose the set of relevant variables Xi that describes the domain.
b. Choose an ordering for the variables
c. While there are variable left
- Pick a variable X and add a node for it
- Set parent (X) to some minimal set of existing nodes such that the
conditional independence property is satisfied.
- Define the CPT for X
3. What is clustering? Explain various types of clustering
methods.
- Assigning class labels to a large no. of objects is costly process.
- Cluster is a collection of data objects,
a. Similar to one another within the same cluster
b. Dissimilar to the objects in other cluster
- Clustering of data is a method by which large sets of data is grouped into
clusters of smaller sets of similar data.
General Applications:
- Typical applications:
a. As a stand-alone tool to get insight into data distribution
b. As a pre-processing step for other algorithms
- Pattern Recognition
- Spatial Data Analysis
a. Create thematic maps in GIS by clustering feature spaces
b. Detect spatial clusters or for other spatial mining tasks
- Image Processing
- Economic Science
- WWW
a. Document classification
b. Cluster Weblog data to discover groups of similar access patterns.
Examples of Clustering Applications:
- Marketing: helps marketers discover distinct groups in their customer
bases, and then use this knowledge to develop target marketing
programs.
- Land use: Identification of areas of similar land use in an earth
observation database.
- Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost.
- City-planning: Identifying a groups of house according to their house
type, value, and geographical location.
- Earth-quate studies: Observed earth quake epicenter should be
clustered along continent faults.
Major Clustering Approaches:
- Partitioning approach:
a. Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors.
b. Typical methods: k-means, k-medoids, CLARANS
- Hierarchical approach:
a. Create a hierarchical decomposition of the set of data using some
criterion.
b. Typical methods: Diana, Agnes, BIRCH, CAMELEON
- Density-based approach:
a. Based on connectivity and density functions
b. Typical methods: DBSCAN, OPTICS, DenClue
- Grid-based approach:
a. Based on a multiple-level granularity structure
b. Typical method: STING, WaveCluster, CLIQUE
- Model-based:
a. A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other.
b. Typical methods: EM, SOM, COBWEB
- Frequent pattern-based:
a. Based on the analysis of frequent patterns
b. Typical methods: p-Cluster
- User-guided or constraint-based:
a. Clustering by considering user-specified or application-specific
constraints
b. Typical methods: COD (obstacles), constrained clustering
- Link-based clustering:
a. Objects are often linked together in various ways
b. Massive links can be used to cluster objects: SimRank, LinkClus
4. Write in detail about DBSCAN algorithm.
Fundamentally, all clustering methods use the same approach i.e. first we
calculate similarities and then we use it to cluster the data points into
groups or batches.
Density-based spatial clustering of applications with noise(DBSCAN):
Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.
Why DBSCAN?
Partitioning methods and hierarchical clustering work for finding spherical-
shaped clusters or convex clusters. In other words, they are suitable only
for compact and well-separated clusters. Moreover, they are also severely
affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
- Clusters can be of arbitrary shape such as those shown in the figure
below.
- Data may contain noise.

DBSCAN algorithm requires two parameters:


a. Eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. One way to find the eps value is based on the k-distance
graph.
b. MinPts: Minimum number of neighbors within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen
at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within
eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

Algorithm:
a. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
b. For each core point if it is not already assigned to a cluster, create a new
cluster.
c. Find recursively all its density connected points and assign them to the
same cluster as the core point.
d. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise
5. Write in detail about Support Vector Machines.
Support Vector Machine is a supervised machine learning algorithm used for
both classification and regression. The objective of SVM algorithm is to find
a hyperplane in an N-dimensional space that distinctly classifies the data
points.
The dimension of the hyperplane depends upon the number of features. If
the number of input features is two, then the hyperplane is just a line. If
the number of input features is three, then the hyperplane becomes a 2-D
plane. It becomes difficult to imagine when the number of features exceeds
three.
Let’s consider two independent variables x1, x2 and one dependent variable
which is either a blue circle or a red circle.

From the figure above its very clear that there are multiple lines that
segregates our data points or does a classification between red and blue
circles. So how do we choose the best line or in general the best hyperplane
that segregates our data points.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the
largest separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data
point on each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin. So from the above figure, we
choose L2.
Let’s consider a scenario like shown below

The SVM algorithm has the characteristics to ignore the outlier and finds
the best hyperplane that maximizes the margin. SVM is robust to outliers.
So in thus type of data points what SVM does is, it finds maximum margin as
done with previous data sets along with that it adds a penalty each time a
point crosses the margin.

Say, our data is like shown in the figure above. SVM solves this creating a
new variable using a kernel. We call a point x i on the line and we create a new
variable yi as a function of distance from origin o. So if we plot this we get
something like as shown below.

In this case, the new variable y is created as a function of distance from the
origin. A non-liner function that creates a new variable is referred to as
kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and
transforms it into higher-dimensional space, i.e. it converts non separable
problem to separable problem. It is mostly useful in non-linear separation
problems. Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data based on
the labels or outputs defined.
6. Explain Regression techniques with examples.
- Numerical prediction is similar to classification
a. Construct a model
b. Use model to predict continuous or ordered value for a given input
- Prediction is different from classification
a. Classification refers to predict categorical class label.
b. Prediction models continuous-valued function
- Major method for prediction: regression
a. Model the relationship between one or more independent or predictor
variables and a dependent or response variable.
- Types of Regression
a. Linear (simple regression and multiple regression)
b. Non-linear regression
c. Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear Vs Non-Linear:
Linear Regression – The degree of a polynomial is 1 in linear regression.
Non-linear Regression – The degree of a polynomial > 1
Linear Regression Example:
Linear Regression: Involves a response variable y and a single predictor
variable x.
Y = w0 + w1x
Where w0 (y-intercept) and w1(slope) are regression coefficients
Methods of least squares: estimates the best fitting straight line
W1 = ∑(xi – x’)(yi – y’)/∑(xi – x’)2  1
W0 = y – w1x  2
Multiple Linear regression: involves more than one predictor variable
a. Training data is of the form (X1, y1), (X2, y2), . . ., (X|D|, y|D|)
b. Ex. For 2-D data, we may have: y = w 0 + w1x1 + w2x2
c. Solvable by extension of least square method or using SAS, S-Plus
d. Many nonlinear function can be transformed into the above.
Example: Straight-line regression using the method of least squares. Table
6.7 shows a set of paired data where x is the number of years of work
experience of a college graduate and y is the corresponding salary of the
graduate.
The 2-D data can be graphed on a scatter plot, as in Figure 6.26. The plot
suggests a linear relationship between the two variables, x and y
We model the relationship that salary may be related to the number of
years of work experience with the equation y = w0 + w1x.

Given the above data, we compute x = 9.1 and y = 55.4. Substituting these
values into equations 1 and 2, we get
Thus, the equation of the least squares line is estimated by y = 23.6 + 3.5x.
Using this equation, we can predict that the salary of a college graduate with
say, 10 years of experience is $58,600.
Non-Linear Regression:
- Some nonlinear models can be modeled by a polynomial function
- A polynomial regression can be transformed into linear regression model.
For example,
Y = w0 + w1x + w2x2 + w3x3
Convertible to linear with new variables: x 2 = x2, x3 = x3
Y = w0 + w1x + w2x2 + w3x3
- Other functions, such as power function, can also be transformed to
linear model.
- Some models are intractable non-linear
a. Possible to obtain least square estimates through extensive
calculation on more complex formulae.
7. What is outlier detection? Explain distance-based outlier
detection.
- What are outliers?
a. The set of objects are considerably dissimilar from the remainder of
the data.
- Problem: Define and find outlier in large data sets
- Applications:
a. Credit card fraud detection
b. Telcom fraud detection
c. Customer segmentation
d. Medical analysis
Statistical Approaches:

- Assume a model underlying distribution that generates dataset


- Use discordancy tests depending on
a. Data distribution
b. Distribution parameter
c. Number of expected outliers
- Drawbacks
a. Most tests are for single attribute
b. In many cases, data distribution may not be known.
Outlier Discovery: Distance-Based Approach
- Introduced to counter the main limitations imposed by statistical
methods
a. We need multi-dimensional analysis without knowing data distribution
- Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T
such that at least a fraction p of the objects in T lies at a distance
greater than D from O.
- Algorithms for mining distance-based outliers
a. Index-based algorithm
b. Nested-loop algorithm
c. Cell-based algorithm
Density-Based Local Outlier Detection
- Distance-based outlier detection is based on global distance distribution
- It encounters difficulties to identify outliers id data is not uniformly
distributed.
- Ex. C1 contains 400 loosely distributed points, C 2 has 100 tightly
condensed points, 2 outlier points o1, o2
- Distance-based method cannot identify o2 as an outlier
- Need the concept of local outlier.
- Local outlier factor (LOF)
a. Assume outlier is not crisp
b. Each point has a LOF
8. Explain Hierarchical Clustering Algorithms with neat diagram.
A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
a. Identify the 2 clusters which can be closes together, and
b. Merge 2 maximum comparable clusters. We need to continue these steps
until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram graphically represents this
hierarchy and is an inverted tree that describes the order in which factors
are merged or clusters are broken up.
a. Agglomerative: Initially consider every data point as an individual cluster
and at every step, merge the nearest pairs of the cluster. At first, every
dataset is considered as an individual entity of cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Let’s say we have six data points A, B, C, D, E, and F.

b. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the
opposite the Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from
the clusters which aren’t comparable. In the end, we are left with N
clusters.

9. What is information gain and Construct the decision tree using


example Dataset.
Information Gain:
When we use a node in a decision tree to partition the training instances into
smaller subsets the entropy changes. Information gain is a measure of this
change in entropy.
Definition Suppose S is a set of instances, A is an attribute, S v is the subset
of S with A = v, and Values (A) is the set of all possible values of A, then

Entropy is the measure of uncertainty of a random variable, it characterizes


the impurity of an arbitrary collection of examples. The higher the entropy
more the information content.
Building Decision Tree using Information Gain the essentials:
- Start with all training instances associated with the root node
- Use info gain to choose which attribute to label each node with
- Note: No root-to-leaf path should contain the same discrete attribute
twice
- Recursively construct each subtree on the subset of training instances
that would be classified down that path in the tree
- If all positive or all negative training instances remain, the label that
node “yes” or “no” accordingly
- If no attributes remain, label with a majority vote of training instances
left at that node
- If no instances remain, label with a majority vote of the parent’s training
instances.
Example: Now, let us draw a Decision Tree for the following data using
Information gain. Training set: 3 features and 2 classes.
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree using
Information gain. We will take each of the features and calculate the
information for each feature.

Split on feature X
Split on feature Y

Split on feature Z:
From the above images, we can see that the information gain in maximum
when we make a split on feature Y. So, for the root node best-suited feature
is feature Y. Now we can see that while splitting the dataset by feature Y,
the child contains a pure subset of the target variable. So we don’t need to
further split the dataset. The final tree for the above dataset would look
like this:
10. Explain Spatial Data Mining.
- Geometric, geographic or spatial data: space-related data
a. Example: Geographic space, VLSI design, model of human brain, 3-D
space representing the arrangement of chains of protein molecule.
- Spatial database system vs. image database systems.
a. Image database system: handling digital raster image, may also
contain techniques for object analysis and extraction from images and
some spatial database functionality.
b. Spatial database system: handling objects in space that have identify
and well-defined extents, locations and relationships.
- Spatial databases contain spatial-related information.
a. Ex:- Geographic databases(maps), VLSI, CAD design.
- Spatial databases represented in raster format which consisting of n-
dimensional bit maps or pixels
a. Ex:- weather forecasting, which is generally represented in raster
format, where each pixel registers rain fall in given area.
- Data mining techniques used to describe the characteristics of particular
location where rainfall is heavy or less.
- A spatial databases that stores spatial objects that change with time
period is called spatiotemporal database.
GIS (Geographic Information System)
- GIS
a. Analysis and visualization of geographic data
- Common analysis function of GIS
a. Search
b. Location analysis
c. Terrain analysis
d. Flow analysis
e. Distribution
f. Spatial analysis/statistics
g. Measurements
Modeling Spatial Objects:
- Two important alternative views
a. Single objects: distinct entities arranged in space each of which has
its own geometric description
Ex: modeling cities, forests, rivers
b. Spatially related collection of objects: describe space itself
Ex: modeling land use, partition of a country into districts
Example: British Columbia Weather Pattern Analysis
- Input:
a. A map with about 3,000 weather probes scattered in B.C.
b. Daily data for temperature, precipitation, wind velocity, etc.
c. Data warehouse using star schema
- Output:
a. A map that reveals patterns: merged (similar) regions
- Goals:
a. Interactive analysis
b. Fast response time
c. Minimizing storage space used
- Challenge:
a. A merged region may contain hundreds of “primitive” regions
Star Schema of the BC Weather Warehouse
11. Explain Text Mining.
- Text Databases are the database that contains word descriptions for
objects. These are long sentences / paragraphs / documents. Ex:-
summary reports, product specifications, any other documents.
- Text Databases are rapidly growing due to the increasing amount of
information available in electronic form, such as electronic publications,
various kinds of electronic documents, e-mail and the world wide web.
- Data stored in most text databases are semi-structured data in that
they are in neither completely unstructured nor completely structured.
Text Data Analysis and Information Retrieval
- IR is a field that has been developing in parallel database systems for
many years, which has focused on query and transaction processing of
structured data.
- Information retrieval is concerned with the organization and retrieval of
information from a large number of text-based documents.
- Some database system problems are usually not present in information
retrieval systems, such as concurrency control, transaction management,
and update.
- A typical information retrieval problem is to locate relevant documents in
a document collection based on a user’s query, which is often some
keywords describing an information need, although it could also be an
example relevant.
- Basic Measure for Text Retrieval: Precision and Recall
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query. It is formally defined as
Precision= |{Relevant} ∩ {Retrieved}|/|{Retrieved}|
Recall: This is the percentage of retrieved documents that are relevant to
the query and were, in fact, retrieved. It is formally defined as
Recall = |{Relevant} ∩ {Retrieved}|/|{Relevant}|
An information retrieval system often needs to trade off recall for
precision or vide versa. One commonly use trade-off is the F-score, which is
defined as the harmonic mean of recall and precision:

Text Retrieval Methods:


- Two categories: document selection problem or document ranking
problem.
- In document selection methods, the query is regarded as specifying
constraints for selecting relevant documents. A typical method of this
category is the Boolean retrieval model.
- Document ranking methods use the query to rank all documents in the
order of relevance. For ordinary users and exploratory queries, these
methods are more appropriate than document selection methods.
Text Indexing Techniques:
- There are several popular text retrieval indexing techniques, including
inverted indices and signature files.
- An inverted index is an index structure that maintains two hash indexed
or B+ tree indexed tables: document table and term table.
- A signature file is a file that stores a signature record for each
document in the database.
- Each signature has a fixed size of b bits representing terms. A simple
encoding scheme goes as follows. Each bit of a document signature is
initialized to 0. A bit is set to 1 if the term it represents appears in the
document.
Applications:
- News article classification
- Automatic email filtering
- Webpage classification
- Word sense disambiguation
12. Demonstrate the methodologies used for Data Stream
mining.
Data Streams:
- Data Streams – continuous, ordered, changing, fast, huge amount
- Traditional DBMS – data stored in finite, persistent data sets
Characteristics:
- Huge volumes of continuous data, possible infinite
- Fast changing and require fast, real-time response
- Data stream captures nicely our data processing needs of today
- Random access is expensive – single scan algorithm
- Store only the summary of the data seen thus far.
- Most stream data are at pretty low-level or multi-dimensional in nature,
needs multi-level and multi-dimensional processing
Methodology:
- Synopses (trade-off between accuracy and storage)
- Use synopsis data structure, much smaller than their base data set
- Compute an approximate answer within a small error range (factor ε of
the actual answer)
Major methods:
- Random Sampling
- Histogram
- Sliding windows
- Multi-resolution model
- Sketches
- Randomized algorithms
Random sampling (but without knowing the total length in advance)
- Reservoir sampling: maintain a set of s candidates in the reservoir, which
from a true random sample of the element seen so far in the stream. As
the data stream flow, every new element has a certain probability (s/N)
of replacing an old element in the reservoir.
Sliding windows
- Make decisions based only on recent data of sliding window size w
- An element arriving at time t expires at time t + w
Histograms
- Approximate the frequency distribution element values in a stream
- Partition data into set of contiguous buckets
- Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)
Multi-resolution models
- Popular models: balanced binary trees, micro-clusters, and wavelets
Sketches:
- Histograms and wavelets require multi-passes over the data but sketches
can operate in a single pass
- Frequency moments of a stream A = {a1, . . . ., aN}, Fk: Fk = i=1∑v (mi)k
where v: the universe or domain size, mi the frequency of i in the
sequence
Given N elts and v values, sketches can approximate F 0, F1, F2 in O(log v +
log N) space
Randomized algorithms
- Monte Carlo algorithm: bound on running time but may not return correct
result
- Chebyshev’s inequality: P(| X – μ | > k) ≤ σ2 / k2
a. Let X be a random variable with mean μ and standard deviation σ
- Chernoff bound: P[X < (1 + δ)µ] < e-μδ^2/4
a. Let X be the sum of independent Poisson trails X1, . . . ., Xn, δ in (0, 1]
b. The probability decreases exponentially as we move from the mean.
Applications:
- Telecommunication calling records
- Business: credit card transaction flows
- Network monitoring and traffic engineering
- Financial market: stock exchange
- Engineering & industrial process: power supply & manufacturing
- Sensor, monitoring & surveillance: video streams, RFIDs
- Security monitoring
- Web logs and Web page click streams
- Massive data sets
13. Show the functionalities of Time series Data
Time-series database:
- Consists of sequence of values or events changing obtained over repeated
measurements of time.
- Data is recorded at regular intervals
- Characteristic time-series components
a. Trend
b. Cycle
c. Seasonal
d. Irregular
Applications:
- Financial: stock price, inflation
- Industry: power consumption
- Scientific: experiment results
- Meteorological: precipitation
Mining Time-Series Data:

- A time series can be illustrated as a time-series graph which describes a


point moving with the passage of time.
Estimation of Trend Curve
- The freehand method
a. Fit the curve by looking at the graph
b. Costly and barely reliable for large-scaled data mining
- The least-squared method
a. Find the curve minimizing the sum of the squares of the deviation of
points on the curve from the corresponding data points
- The moving-average method (of order n)

Trend Discovery in Time-series


- Estimation of cyclic variations
a. If periodicity of cycles occurs, cyclic index can be constructed in
much the same manner as seasonal indexes.
- Estimation of irregular variations
a. By adjusting the data for trend, seasonal and cyclic variations
- With the systematic analysis of the trend, cyclic, seasonal, and irregular
components, it is possible to make long or short-term predictions with
reasonable quality.
14. Write about Web Usage Mining.
- In this Internet based technology, data objects are linked together to
provide interactive access. Capturing user interesting patterns in such
distributed environments in known as web mining.
- Relevance Ranking
- Page Ranking
a. DOM-based Page Segmentation
b. Vision-based Page Segmentation
- Block-based Web Search
a. Index block instead of whole page
b. Block retrieval
Combining DocRank and BlockRank
c. Block query expansion
Select expansion term from relevant blocks
Search Engine – Two Rank Functions
Relevance Ranking:
- Inverted index
a. A data structure for supporting text quires like index in a book

Motivation for VIPS (Vision-based Page Segmentation)


- Problems of treating a web page as an atomic unit
a. Web page usually contains not only pure content
Noise: navigation, decoration, interaction
b. Multiple topics
c. Different parts of a page are not equally important
- Web page has internal structure
a. Two-dimensional logical structure & Visual layout presentation
b. > Free text document
c. < Structured document
- Layout – the 3rd dimension of web page
a. 1st dimension: content
b. 2nd dimension: hyperlink
15. What is multimedia mining.
- A multimedia database system stores and manages a large collection of
multimedia data, such as audio, video, image, graphics, speech, text,
document, and hypertext data, which contain text, text markups, and
linkages.
- Multimedia data mining focuses on image data mining.
- Multimedia data mining methods:
a. Similarity search, multidimensional analysis, classification and
prediction analysis, and mining associations in multimedia data.
Similarity Searching in Multimedia Data
- Consider two main families of multimedia indexing and retrieval systems:
a. Description-based retrieval systems, which build indices and
perform object retrieval based on image descriptions, such as
keywords, captions, size, and time of creation
b. Content-based retrieval systems, which support retrieval based on
the image content, such as color histogram, texture, pattern, image
topology, and the shape of objects and their layouts and locations
within the image.
- Description-based retrieval is labor-intensive if performed manually. If
automated, the results are typically of poor quality
- In a content-based image retrieval system, there are often two kinds of
queries:
a. Image-sample-based queries find all of the images that are similar
to given image sample
b. Image feature specification queries specify or sketch image
features like color, texture, or shape, which are translated into a
feature vector to be matched with the feature vectors of the images
in the database.
Multidimensional Analysis of Multimedia Data
- Multimedia databases store image, audio, text and video data.
Ex: voice-mail systems, speech based user interface, video-on-demand
- Multimedia data cube
a. Design and construction similar to that of traditional data cubes from
relational data
b. Contain additional dimensions and measures for multimedia
information, such as color, texture, and shape.
- The database does not store image but their descriptors
a. Feature descriptor: a set of vectors for each visual characteristic
b. Color vector: contains the color histogram
c. MFC (Most Frequent Color) vector: five color centroids
d. MFO (Most Frequent Orientation) vector: five edge orientation
centroids
e. Layout descriptor: contains a color layout vector and an edge
Classification and prediction, Association of Multimedia Data
- Classification and predictive modeling have been used for mining
multimedia data, especially in scientific research, such as astronomy,
seismology, and geoscientific research
- In general, all of the classification methods can be used in image analysis
and pattern recognition.
- Moreover, in-depth statistical pattern analysis methods are popular for
distinguishing subtle features and building high-quality models.
- Association rules involving multimedia objects can be mined in image and
video databases. At least three categories can be observed:
a. Associations between image content and non-image content features
b. Associations among image contents that are not related to spatial
relationships
c. Associations among image contents related to spatial relationships

You might also like