1.1 Project Overview: Data Mining

1.
INTRODUCTION
1.1 Project Overview
Data Mining
Data mining, the Extraction Of hidden predictive information from large

databases is a more powerful new technology with great potential to help companies
which focus on the most important information in their data warehouses. Data mining
tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge driven decisions. The automated, prospective analysis offered by data mining
move beyond the analysis of past events provided by retrospective tools typical of
decision support systems. Data mining tools can answer business questions which are
traditionally too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside their
expectations.
Data mining is seen as an increasingly important tool by modern business to

transform data into an informational advantage. It is currently used in wide range of
profiling practices, such as marketing, Surveillance at Mining essentials, fraud detection
and scientific discovery. The amount of data kept in computer files and databases are
growing rapidly. At the same time the users of these data are expecting more
sophisticated information from the dataset.
Basic Data mining tasks
 Classification
A classification task begins with a data set in which the class assignments are
known. For example, a classification model that predicts credit risk could be developed
based on observed data for many loan applicants over a period of time. In addition to the
historical credit rating, the data might track employment history, home ownership or
rental, years of residence, number and type of investments, and so on. Credit rating would
1
be the target, the other attributes would be the predictors, and the data for each customer
would constitute a case.
 Regression
A Regression task begins with a data set in which the target values are known. For
example, a regression model that predicts house values could be developed based on
observed data for many houses over a period of time. In addition to the value, the data
might track the age of the house, square footage and number of rooms and so on.
House value would be the target, the other attributes would be the predictors, and the data
for each house would constitute a case.
 Clustering
Clustering is a data mining (machine learning) technique used to place data

elements into related groups without advance knowledge of the group definitions.
Popular clustering techniques include k-means clustering and Fuzzy c mean clustering.
Clustering is the process of making group of abstract objects into classes of similar
objects.
 A cluster of data objects can be treated as a one group.
 While doing the cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the label to the groups.
 The main advantage of Clustering over classification is that, It is adaptable to changes

and help single out useful features that distinguished different groups.
 Summarization
Summarization is the abstraction or generalization of data. A set of task- relevant

data is summarized and abstracted, resulting a smaller set which gives a general overview
of the data usually with aggregation information. For example, the long distance calls of a
customer can be summarized into total minutes, total spending, total calls, such high-
2
level, summary information, instead of detailed calls, is presented to the sales manager of
customer analysis.
The summarization can go up to the different levels of abstraction and can be

viewed from different angles. Different combinations of abstraction levels and
dimensions reveal various kinds of patterns and regularities.
 Dependency Modeling
Dependency Modeling consists of finding a model which describes significant

dependencies between variables. Dependency models exist at two levels:
 The structural level of the model specifies (often graphically) which

variables are locally dependent on each other, and
 The quantitative level of the model specifies the strengths of the
dependencies using some numerical scale
CLUSTERING
Clustering is the process of dividing the data elements into classes or clusters so
that the items in the same class are as similar as possible, and the items in different
classes are as dissimilar as possible. We divide them depending on the time series i.e., the
behavior of the user.
Clustering analysis finds clusters of data objects that are similar in some sense to
one another. The members of a cluster are more like each other than they are like
members of other clusters. The goal of clustering analysis is to find high-quality clusters
such that the inter-cluster similarity is low and the intra-cluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification,
clustering models segment data into groups that were not previously defined.
Classification models segment data by assigning it to previously-defined classes, which
are specified in a target. Clustering models do not use a target.
3
Clustering is useful for exploring data. If there are many cases and no obvious
groupings, clustering algorithms can be used to find natural groupings. Clustering can
also serve as a useful data pre-processing step to identify homogeneous groups to build
supervised models.
Requirements of clustering
In data mining, efforts have focused on finding methods for efficient and effective
cluster analysis in larger database. Active themes of research focus on:
 The scalability of clustering method.

 The effectiveness of the methods for clustering complex shapes and types of data.
 The high dimensional clustering techniques.
 The methods for clustering the mixed, numerical and categorical data in large
databases.
Clustering is a challenging field of research in which its potential applications

pose their own special requirements. The following are typical requirements of clustering
in data mining.
 Scalability: Many clustering algorithms work well on small data sets containing
fewer than several hundred data objects; however a large database may contain
millions of objects. Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes: Many algorithms are designed
to cluster interval-based data. However, applications may require clustering for
other types of data, such as binary, nominal, and ordinary data or mixtures of
these data types.
 Discovery of clusters with arbitrary shapes: Many clustering algorithms
determine cluster based on Euclidean or Manhattan distance measures.
Algorithms based on such distance measures tend to find spherical clusters with
similar size and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.
4
Advantages of Clustering
 Automatic recovery from failure.

 Ability to perform maintenance and upgrades with limited downtime.
 Retrieving the data from the database is easy.
Disadvantages of Clustering
 Complexity and inability to recover from the database corruption.

 Clustering doesn’t provide protection from network service failures.
5
1.2 Existing System
Efficient partitioning of high dimensional data sets into clusters is a fundamental

problem in data mining. The standard hierarchal clustering methods provide no solution
for this problem due to their computational inefficiency. However the k-means based
methods are inefficient in processing the high-dimensional data sets.
Fuzzy association rules can also be derived from high-dimensional numerical
datasets, like image datasets, in order to train fuzzy associative classifiers. But, because of
the peculiarity of such datasets, traditional fuzzy ARM algorithms are not able to mine
rules from them efficiently, since such algorithms are meant to deal with datasets with
relatively much less number of attributes/dimensions not only that classification of the
datasets are also very difficult.
1.2.1 Disadvantages
 The K-mean algorithm cannot handle the high dimensional datasets such as
image, videos etc.,
 Traditional fuzzy ARM algorithms are not able to mine the large amount of
data efficiently.
 The traditional algorithms are work with very less number of attributes/
dimensions.
 Classification of the large amount of data is also very difficult.
6
1.3 Proposed System
1.3.1 Description of Proposed System
The proposed system is an extension to the existing system which improves the
Fuzzy ARM algorithm and builds the frame work of fuzzy based clustering for good
accurate results.
We are currently working on SURF algorithm for the object recognition or 3D

reconstruction. It is partly inspired by the SIFT descriptor. The standard version of SURF
is several times faster than SIFT. The task of finding correspondences between two
images of the same scene or object is part of many computer vision applications. Camera
calibration, 3D reconstruction, image registration, and object recognition.
After that we perform the clustering by using Fuzzy c-mean. Fuzzy c-means is a
method of clustering which allows one piece of data to belong to two or more clusters.
This method is frequently used in pattern recognition. It is based on minimization of the
objective function.
1.3.2 Working of Proposed System
SURF Algorithm:
SURF (Speeded Up Robust Features) is a robust local feature detector, first

presented by Herbert Bay et al. ECCV 9th in International Conference on Computer
Vision held in Austria in May 2006, that can be used in computer vision tasks like object
recognition or 3D reconstruction. It is partly inspired by the SIFT descriptor. The
standard version of SURF is several times faster than SIFT and claimed by its authors to
be more robust against different image transformations than SIFT. SURF is based on
sums of2D Haar wavelet responses and makes an efficient use of integral images.
It uses an integer approximation to the determinant of Hessian blob detector,

which can be computed extremely quickly with an integral image. For features, it uses the
sum of the Haar wavelet response around the point of interest. Again, these can be
computed with the aid of the integral image. This information is treated to perform
operations such as locate and recognition of certain objects, people or faces, make 3D
7
scenes, object tracking and extraction of points of interest. This algorithm is part of that
artificial intelligence, able to train a system to interpret images and determine the content.
Algorithm and features
Detection:
The SURF algorithm is based on the same principles and steps off SIFT, but it
uses a different scheme and should provide better results: it works much faster. In order
to detect characteristic points on a scale invariably SIFT approach it uses cascaded filters,
where the difference Gaussian, DOG, is calculated on rescaled images progressively.
Integral Image
Instead of using Gaussian averaging the image, squares are used (approximation).
Making the convolution of the image with a square is much faster if the integral image is
used. The integral image is defined as:
𝑗≤𝑦
S(x,y) = ∑𝑖≤𝑥
𝑖=0 ∑𝑗=0 𝐼 (𝑖, 𝑗)
The sum of the original image within a rectangle D image can be evaluated
quickly using this integrated image. I(x, y) added over the selected area requires four
evaluations S(x, y) (A, B, C, D)
Points of interest in the Hessian matrix
SURF uses a BLOB detector based on the Hessian to find points of interest. The
determinant of the Hessian matrix expresses the extent of the response and is an
expression of a local change around the area.
The detector is based on the Hessian matrix, due to its high accuracy. More
precisely, BLOB structures are detected in places where the determining factor is the
maximum. In contrast to the detector Hess - Laplace Mikolajczyk and Schmid, also is
based on the determinant of the Hessian for selecting scale, as it is done by Lindeberg.
Given a point x = (x, y) in an image I, the Hessian matrix H (x, σ) in x at scale σ , is
defined as follows:
8
𝐿𝑥𝑥 (𝑥, 𝜎) 𝐿𝑥𝑦(𝑥, 𝜎)
H(x,σ)=( )
𝐿𝑥𝑦(𝑥, 𝜎) 𝐿𝑦𝑦(𝑥, 𝜎)
Where Lxx(𝑥, 𝜎) is the convolution of second order derivative with the

image in the point x, y similarly with Lxy(x,𝜎) and with Lyy(x,𝜎) .
The Gaussian filters are optimal for scale space analysis, but in practice should be
quantized and clipped. This leads to a loss of repeatability image rotations around the odd
multiple of π / 4. This weakness is true for Hessian-based detectors in general.
Repeatability of peaks around multiples of π / 2. This is due to the square shape of the
filter. However, the detectors still work well, the discretization has a slight effect on
performance. As real filters are not ideal, in any case, given the success of Lowe with
logarithmic approximations, they push the approximation of the Hessian further matrix
with square filters. These second order Gaussian filters' approximate can be evaluated
with a cost very low with the use of integrated computer images. Therefore, the
calculation time is independent of the filter size. Here are some approaches: Gyy and
Gxy.
The box filters 9x9 are approximations of a Gaussian with σ = 1.2 and represents
the lowest level (higher spatial resolution ) for computerized maps BLOB response. Is
denoted Dxx, Dyy, Dxy . The weights applied to the rectangular regions are maintained
by the efficiency of the CPU.
Images are calculated :
-Dxx (x, y ) from I ( x, y) and Gxx ( x, y)
-Dxy (x, y ) from I ( x, y) and Gxy (x, y )
-Dyy (x, y ) from I ( x, y) and Gyy (x, y)
Then, the following image is generated:
Det(Haprox)=DxxDyy-(wDxy)2
The relative weighting (w) of the filter response is used to balance the expression
for the Hessian determinant. It is necessary for the conservation of energy between
Gaussian kernels and Gaussian kernels approximate.
9
||𝐿𝑥𝑦 (1.2)||𝐹||𝐷𝑦𝑦 (9)||𝐹
W= 𝐿𝑦𝑦 (1.2)||𝐹||𝐷𝑥𝑦(9)||𝐹
0.9 factor appears such a correction factor using squares instead of gaussians. it can
generate several images det (H) for several filter sizes . This is called multi- resolution
analysis.
is the Frobenius norm:
Frobenius changes of weighting depends on scale σ . In practice, this factor is kept

constant . How it keep constant? Normalizing the filter response relative to its size. This
ensures the Frobenius norm for any other filter.
The approximation of the determinant of the Hessian matrix representing the response
BLOB image on location x . These responses are stored in the BLOB response map on
different scales.
Then, the local maxima are searched.
Scale-space representation & location of points of interest
The attractions can be found in different scales, partly because the search for
correspondences often requires comparison images where they are seen at different
scales. The scale spaces are generally applied as a pyramid image . Images are repeatedly
smoothed with a Gaussian filter, then, is sub sampled to achieve a higher level of the
pyramid. Therefore, several floors or stairs "det H" with various measures of the masks
are calculated:
(𝐵𝑎𝑠𝑒 𝐹𝑖𝑙𝑡𝑒𝑟 𝑆𝑐𝑎𝑙𝑒)

𝜎𝑎𝑝𝑝𝑟𝑜𝑥 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑓𝑖𝑙𝑡𝑒𝑟 𝑠𝑖𝑧𝑒 ∗ ( )
𝐵𝑎𝑠𝑒 𝐹𝑖𝑙𝑡𝑒𝑟 𝑆𝑖𝑧𝑒
10
The scale -space is divided into a number of octaves, Where an octave refers to a
series of response maps of covering a doubling of scale . In SURF The Lowest level of
the Scale- space is Obtained from the output of the 9 × 9 filters.
Scale spaces are implemented by applying box filters of different size. Therefore,
the scale space is analyzed by up-scaling the filter size rather than iteratively reducing the
image size. The output of the above 9*9 filter is considered as the initial scale layer, to
which we will refer as scale s=1.2 (corresponding to Gaussian derivatives withσ=1.2).
The following layers are obtained by filtering the image with gradually bigger masks,
taking into account the discrete nature of integral images and the specific structure of or
filters. Specifically, this results in filters of size 9*9, 15*15, 21*21, 27*27, etc. In order
to localize interest points in the image and over scales, non-maximum suppression in a
3*3*3 neighborhood is applied. The maxima of the determinant of the Hessian matrix are
then interpolated in scale and image space with the method proposed by Brown et al.
Scale space interpolation is especially important in our case, as the difference in scale
between the first layers of every octave is relatively large.
After 3D maxima are looking at ( x, y, n) using the cube 3x3x3 neighborhood .

From there it is proceed to do the interpolation of the maximum. Lowe rest of the layers
of the pyramid to get the DOG (Difference of Gaussian ) find images contours and stains.
Specifically, it is entered by variant a quick and Van Gool Neubecker used.The

maximum of the determinant of the Hessian matrix in scale and space interpolated image
with Brown and Lowe proposed method. The approach of the determinant of the Hessian
matrix represents the response of BLOB in the image to the location x . These responses
are stored in the BLOB map of responses on different scales. Them have the principal
feature of repetibility, that means if some point is considerated realiable, the detector will
find the same point under different perspective (scale, orientation, rotation, etc.).
It has one position (x,y) for each interest point.
11
Fuzzy C-means Clustering
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic,
rather than belonging completely to just one cluster. Thus, points on the edge of a cluster,
may be in the cluster to a lesser degree than points in the center of cluster. For each point
x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the
sum of those coefficients is defined to be 1:
∀𝑥 ∑𝑛𝑢𝑚.𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠
𝑘=1 𝑢𝑘 (x) = 1.
With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their
degree of belonging to the cluster:
∑𝑥 𝑢𝑘(𝑥)𝑚 𝑥
Centerk = ∑ 𝑚
𝑥 𝑢𝑘 (𝑥)
The degree of belonging is related to the inverse of the distance to the cluster center:
1
Uk(x) = 𝑑(𝐶𝑒𝑛𝑡𝑒𝑟,𝑥)
Then the coefficients are normalized and fuzzy fied with a real parameter m > 1 so that
their sum is 1. So
For m equal to 2, this is equivalent to normalizing the coefficient linearly to make their
sum 1. When m is close to 1, then cluster center closest to the point is given much more
weight than the others, and the algorithm is similar to k-means.
12
The fuzzy c-means algorithm is very similar to the k-means algorithm:
 Choose a number of clusters.

 Assign randomly to each point coefficients for being in the clusters.
 Repeat until the algorithm has converged (that is, the coefficients’ change
between two iterations is no more than ε, the given sensitivity threshold).
Compute the centroid for each cluster, using the formula above. For each point, compute
its coefficients of being in the clusters, using the formula.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-
means, the minimum is a local minimum, and the results depend on the initial choice of
weights. The Expectation-maximization algorithm is a more statistically formalized
method which includes some of these ideas: partial membership in classes. It has better
convergence properties and is in general preferred to fuzzy-c-means.
1.3.3 Advantages:
 We are using SURF algorithm which is several times faster than the SIFT
algorithm.
 SURF algorithm is used to detect structures or very significant points in an image
for a discriminating description of these areas from its neighbouring points.
 The recognition of images or objects, is one of the most important applications of
computer vision, becomes a part of local descriptors SIFT.
 Unlike k-means where data point must exclusively belong to one cluster center
here data point is assigned membership to each cluster center as a result of which
data point may belong to more than one cluster center.
 Gives best result for overlapped data set and comparatively better than k-means
algorithm.
13
2. LITERATURE SURVEY
2.1 Fuzzy Apriori

To implement the fuzzy association rule mining procedure, we used a
modified version of the Apriori algorithm. The algorithm is definitely much more
economical than straight forwardly applying Apriori, treating negative items as new
database attributes. It is also very much preferable to the approach for mining negative
association rules which involves the costly generation of infrequent as well as
frequent item sets. Regarding the quality of the mined association rules, we observed
that most of them are negative. This can be explained as follows: when for each
transaction t and each collection L1 , L2 …. Ln of [0, 1]– valued positive attributes
corresponding to a quantitative attribute Q, it holds that then at the same time. In other
words, the overall support associated with positive items will be 1, while that
associated with negative items will be p − 1, which accounts for the dominance of the
latter. Since typically p is between 3 and 5, the problem however manifests itself on a
much smaller scale than in supermarket databases. To tackle it, we can e.g. use
different thresholds for positive rules, and for rules that contain at least one negative
item. However, this second threshold apparently should differ for every quantitative
attribute since it depends on the number of fuzzy sets used in the partition. A more
robust, and only slightly more time-consuming, approach is to impose additional
filtering conditions and interestingness measures to prune away the least valuable
negative patterns.
2.2 Mining Fuzzy Association Rules in Large High-Dimensional

Datasets
Fuzzy Association Rule Mining (ARM) has been extensively used in

relational or transactional datasets having less to medium number of
attributes/dimensions. The mined fuzzy association rules (patterns) are not only used for
manual analysis by domain experts, but are also leveraged to drive further mining tasks
like classification and clustering which automate decision-making. Such fuzzy
14
association rules can also be derived from high-dimensional numerical datasets,
like image datasets, in order to train fuzzy associative classifiers or clustering
algorithms. Traditional Fuzzy ARM algorithms are not able to mine rules from them
efficiently, since such algorithms are meant to deal with datasets with relatively much
less number of attributes/dimensions. Hence, FAR-HD which is a Fuzzy ARM
algorithm designed specifically for large high-dimensional datasets. FAR-HD processes
fuzzy frequent item sets in a DFS manner using a two-phased multiple-partition tidlist-
based strategy. It also uses a byte-vector representation of tidlists, with the tidlists
stored in the main memory in a Compressed form (using a fast generic compression
method). Additionally, FAR-HD uses Fuzzy Clustering to convert each numerical vector
of the original input dataset to a fuzzy-cluster based representation, which is ultimately
used for the actual Fuzzy ARM process. FAR-HD has been compared experimentally
with Fuzzy Apriori (7–15 times faster), which is the most popular Fuzzy ARM algorithm.
The important features of FAR-HD are that it uses a two phased processing technique,
and a tidlist approach for calculating the frequency of item sets. It also uses a
generic compression algorithm (zlib) to compress tidlists while processing them in order
to fit more tidlists in the same amount of memory allocated/available. zlib provides
very good compression ratio on all kinds of data and datasets. The distinctive feature of
datasets with high dimensions is that they have association rules with many items. i.e. the
average rule length is very high. In order to deal with such association rules, the item set
generation and processing in FAR-HD is done in a DFS like fashion as in ARMOR, as
opposed to BFS-like in Apriori, which is optimized for large datasets with less number of
attributes/dimensions.
2.3 Fuzzy Cluster-Based Association Rules (FCBAR)
The FCBAR method is to create cluster tables by scanning the database

once, and then clustering the transaction records to the kth cluster table, where the length
of a record is k. Moreover, the fuzzy large item sets are generated by contrasts with the
partial cluster tables. This prunes considerable amount of data, reduces the time needed to
perform data scans and Requires less contrast. Experiments with the real-life database
15
show that FCBAR outperforms fuzzy Apriori like algorithm, a well–known and widely
used association rules algorithm. FCBAR method for discovering the fuzzy large item
sets, and the main characteristics are in follow. FCBAR only requires a single scan of the
transaction database, following by contrasts with the partial cluster tables. Not only this
prunes considerable amount of data reducing the time needed to perform data scan and
requiring less contrast, but also ensures the correctness of the mined results.
2.4 SIFT:
For any object there are many features, interesting points on the object that can be
extracted to provide a “feature” description of the object. This description can then be
used when attempting to locate the object in an image containing many other objects.
There are many considerations when extracting these features and how to record them.
SIFT image features provide a set of features of an object that are not affected by many
of the complications experienced in other methods, such as object scaling and rotation.
While allowing for an object to be recognised in a larger image SIFT image features also
allow for objects in multiple images of the same location, taken from different positions
within the environment, to be recognised. SIFT features are also very resilient to the
effects of “noise” in the image.
The SIFT approach, for image feature generation, takes an image and transforms it into a
large collection of local feature vectors. Each of these feature vectors is invariant to any
scaling, rotation or translation of the image. To aid the extraction of these features the
SIFT algorithm applies a 4 stage filtering approach:
Step 1: Scale-Space Extrema Detection
This stage of the filtering attempts to identify those locations and scales that are
identifiable from different views of the same object. This can be efficiently
achieved using a “scale space” function. Further it has been shown under
reasonable assumptions it must be based on the Gaussian function. The scale
space is defined by the function:
16
L(x, y, σ) = G(x, y, σ) * I(x, y)
Where * is the convolution operator, G(x, y, σ) is a variable-scale Gaussian and

I(x, y) is the input image.
Various techniques can then be used to detect stable keypoint locations in the
scale-space. Difference of Gaussians is one such technique, locating scale-space
extrema, D(x, y, σ) by computing the difference between two images, one with
scale k times the other. D(x, y, σ) is then given by:
D(x, y, σ) = L(x, y, kσ) – L(x, y, σ)
To detect the local maxima and minima of D(x, y, σ) each point is compared with
its 8 neighbours at the same scale, and its 9 neighbours up and down one scale. If
this value is the minimum or maximum of all these points then this point is an
extrema.
Step 2: Keypoint Localistaion
This stage attempts to eliminate more points from the list of keypoints by finding
those that have low contrast or are poorly localised on an edge. This is achieved
by calculating the Laplacian.
If the function value is below a threshold value then this point is excluded. This
removes extrema with low contrast. To eliminate extrema based on poor
localisation it is noted that in these cases there is a large principle curvature across
the edge but a small curvature in the perpendicular direction in the defference of
Gaussian function. If this difference is below the ratio of largest to smallest
eigenvector, from the 2x2 Hessian matrix at the location and scale of the
keypoint, the keypoint is rejected.
17
Step 3: Orientation Assignment
This step aims to assign a consistent orientation to the keypoints based on local
image properties. The keypoint descriptor, described below, can then be
represented relative to this orientation, achieving invariance to rotation. The
approach taken to find an orientation is:
 Compute gradient magnitude

 Compute orientation
 Locate the highest peak in the histogram. Use this peak and any other
local peak within 80% of the height of this peak to create a key point
with that orientation.
Step 4: Keypoint Descriptor
The local gradient data, used above, is also used to create keypoint descriptors.
The gradient information is rotated to line up with the orientation of the keypoint
and then weighted by a Gaussian with variance of 1.5 * keypoint scale. This data
is then used to create a set of histograms over a window centred on the keypoint.
Keypoint descriptors typically uses a set of 16 histograms, aligned in a 4x4 grid,

each with 8 orientation bins, one for each of the main compass directions and one
for each of the mid-points of these directions. This results in a feature vector
containing 128 elements.
These resulting vectors are known as SIFT keys and are used in a nearest-
neigbours approach to identify possible objects in an image.
18
3. SYSTEM ANALYSIS
3.1 System requirement specifications
A requirement is a feature that the system must have or a constraint that it must to
be accepted by the client. Requirement engineering aims at defining the requirements of
the system under construction .Requirement engineering include two main activities,
requirement elicitation, which results in the specification of the system that the client
understands, and analysis which in analysis model that the developer can unambiguously
interpret. A requirement is a statement about what the proposed system will do.
Requirements can be divided into two major categories: Functional requirements and
Non Functional requirements.
3.1.1 Functional Requirements
Functional requirements describe the interactions between the system and its
environment independent of its implementation. The environment includes the user and
any other external system with which the system interacts. Functional requirements
capture the intended to behavior of the system, this behavior may be expressed as
services, tasks or functions the system is required to perform.
In product development, it is useful to distinguish between the baseline

functionality necessary for any system to compete in that product domain, and features
that differentiate the system from competitors products and from variants in your
company own product line/family. Features may be additional functionality, or differ
from the basic functionality along some quality attribute (such as performance or memory
utilization).
Consider the two images such that image2 is similar that of image1.Now we have
to find the interesting points from that image by applying SURF algorithm. SURF
algorithm is used for the object reorganization. The standard version of SURF is several
times faster than SIFT.
19
After that we perform the clustering by using Fuzzy Cmean. In fuzzy clustering,
each point has a degree of belonging to cluster, as in fuzzy logic rather than belonging to
just one cluster. FAR-HD is the process which is used for high dimensional datasets to
generate frequent item sets. Traditional Fuzzy ARM algorithms have failed to mine rules
from high-dimensional data efficiently, since those are meant to deal with relatively
much less number of attributes so we use the FAR-HD which processes frequent item
sets using a two-phased multiple-partition approach especially for large high-dimensional
datasets. By using FAC we did the classification of the above clusters.
3.1.2 Non Functional Requirement
Non-functional requirements describe the aspects of the system that are not
directly related to the functional behavior of the system. Non-functional requirements
include a broad variety of requirements that apply to many different aspects of the
system, from usability to performance.
 Portability: The degree of conversion of our application to target system is easy

(e.g., Windows, Unix etc.).
 Efficiency: Our application uses the CPU cycles, memory and disk space
efficiently.
 Understandability: The UI is easily understandable by everyone.
 Accuracy: our application gives accurate results what clients expect.
 Robustness: As we are using SURF algorithm our application can handle any
situation.
 Usability: It is very easy to learn and operate the system.
 Cost and development time: The cost and development time is very less.
20
3.2 Object Oriented Analysis
In the case of object oriented analysis the process is varies. But these two are
identical at use case analysis. Actually the steps involved in the analysis phase are
• Identify the actors.
• Develop a simple business process model using UML activity diagram.
• Develop use cases.
• Prepare interaction diagrams.
• Classification – develop a static UML class diagrams
Identify classes, relationships, attributes, methods.
System models
• Scenarios
• Use Case Model
A Use case is a description of the behavior of the system. That description is

written from the point of view of a user who just told the system to do something
particular.
UML Diagrams
UML includes a set of graphic notation techniques to create visual models of

object-oriented software-intensive systems. UML is used to specify, visualize, modify,
construct and document the artifacts of an object of an object-oriented software-intensive
system under development.
21
3.2.1 Use-Case Diagram
An important part of the Unified Modeling Language (UML) is the facilities for
drawing use case diagrams. Use cases are used during the analysis phase of a project to
identify and partition system functionality. They separate the system into actors and use
cases. Actors represent roles that can play by users of the system. Those users can be
humans, other computers, pieces of hardware, or even other software systems. Use cases
describe the behavior of the system.
Table 3.1 Graphical Notations for Use Case Diagram
Actor An Actor as mentioned is a user of the system and

is depcited using a stick figure. The role of the
user is written beneath the icon. Actors are not
limited to humans.If a system communicates with Actor Role Name
another application and expects input or delivers

output then that application can also be considered
as an actor.
Usecase A Use Case is the functionality provided by the
system typically described as verb+ object (eg: Use Case Name
Register Car, Delete User). Use Cases are depicted

with an ellipse. The name of the Use Case Is
written within the ellipse.
Directed Associations are used to link Actors with use cases
Association and indicates that an actor participates in the Use
Case in some form. Directed Association is same
as association but difference is that it represented
by a line having an arrow head.
22
System You can draw a rectangle around the use cases,
System
boundary called the system boundary box, to indicate the

boxes scope of your system. Anything within the box
represents functionality that is in scope and
anything outside the box is not.
In our use case diagram actor is administrator and functionalities are

images, surf, FCM and FAR-HD. The responsibility of user is that he uploads two images
and by using SURF algorithm he get the interesting points and by using FCM he cluster
the similar data.
Fuzzy Associative Classifier using HD datasets System
Image1
Image2(similar to Image1)
Apply SURF
Apply FCM
Administrator
Apply FAR-HD
Figure 3.1 Use Case Diagram
23
3.3 System Requirements
3.3.1 Hardware Requirements
 Processor : Pentium IV or above

 Hard Disk : 80 GB minimum.
 RAM : 512MB or more.
3.3.2Software Requirements
 Operating System : windows 7

 Programming language : MAT lab
24
4. SYSTEM DESIGN
4.1 Introduction
Systems design is the process of defining the architecture, components, modules,

interfaces, and data for a system to satisfy specified requirements. Systems design could
be seen as the application of systems theory to product development.
4.1.1 Class Diagram
The Class diagram is a static diagram. It represents the static view of an

application. Class diagram is not only used for visualizing , describing and documenting
different aspects of a system but also for constructing executable code of the software
application. It describes the attributes and operations of a class and also the constraints
imposed on the system. The class diagrams are widely used in the modeling of object
oriented systems because they are the only UML diagrams which can be mapped directly
with object oriented languages.
The class diagram shows a collection of classes, interfaces, associations,

collaborations and constraints. It is also known as a structural diagram.
Purpose
The purpose of the class diagram is to model the static view of an application .The
class diagrams are only diagrams which can be directly mapped with object oriented
languages and thus widely used at the time of construction.
The UML diagrams like activity diagram, sequence diagram can only give the
sequence flow of the application but class diagram is a bit different. So it is the most
popular UML diagram in the coder community .So the purpose of the class diagram can
be summarized as:
 Analysis and design of the static view of an application.

 Describe the responsibilities of a system.
 Base for component and deployment diagrams.
25
 Forward and reverse engineering.
Active Class
Active classes initiate and control the flow of activity, while passive classes store
data and serve other classes. Illustrate active classes with a thicker border.
Visibility
Use visibility markers to signify who can access the information which is in a
class. Private visibility hides information from anything outside the class partition. Public
visibility allows all other classes to view the marked information. Protected visibility
allows child classes to access information which is in inherited from a parent class.
Associations
Associations represent static relationship between the classes. Place the

association names above, on or below the association line. Use a filled arrow to indicate
the direction of the relationship. Place roles at the end of an association. Roles represent
how the two classes see each other.
Multiplicity (Cardinality)
Place multiplicity notations at the ends of an association. These symbols

indicate the number of instances of one class linked to the instance of other class.
Constraint
Constraints are placed inside the curly braces {}.
Composition and Aggregation
Composition and aggregation links a semantic association between two classes in

UML diagram. They are used in class diagram. They both differ in their symbols.
26
Generalization
It is a specification relationship in which objects of the specialized element (the

child) are substitutable for objects of the generalization element (the parent). It is used in
class diagram.
In our class diagram classes are Admin, SURF algorithm, Fuzzy c-mean, Feature
detector, descriptor extractor, feature 2D,SURF. The responsibility of admin is to load the
images. Feature detector have the methods such as detect and detect implementation
which are used to detect the features in an image and to implement. Descriptor extractor
compute these feature points and implement the computed values. By using SURF we
implement the detect and compute values. After we get the interesting points we make
them clusters.
27
Admin Fuzzy C-mean
Apply SURF algorithm
+Upload Images() +Select the cluster centers()

+Calculate the Fuzzy centers()
Feature Detector Descriptor Extractor
+Detect() +Compute()
#Detect Implementation() #Compute Implementation()
Feature 2D
SURF
~Detect Implementation()
~Compute Implementation()
Figure 4.1 Class Diagram
28
4.1.2 Sequence Diagram
A sequence diagram shows object interactions arranged in time sequence.

It depicts the objects and classes involved in the scenario and the sequence of messages
exchanged between the objects needed to carry out the functionality of the scenario.
Sequence diagrams are typically associated with use case realizations in the Logical View
of the system under development. Sequence diagrams are sometimes called event
diagrams or event scenarios.
A sequence diagram shows, as parallel vertical lines, different processes or

objects that live simultaneously, and, as horizontal arrows, the messages exchanged
between them, in the order in which they occur.
Table 4.1 Graphical Notations for Sequence Diagram
Object Objects are instances of classes and are

arranged horizontally. The pictorial
representation for an Object is class (a
rectangle) with the name prefixed by the
object name (optional).
Actor Actor can also communicated with
objects so they too can be listed as a
column.An Actor is modeled using the
Actor Role Name
stick figure.
Lifeline The Lifeline identifies the existene of the

object over time. The notation for a life
time is a vertical dotted line etending
from an object.
29
Activation Activation modeled as rectangular boxes
on the lifeline indicate when the object is
performing an action.
Message Messages modeled as horizontal arrows

between Activations indicate the
communication between the objects.
In our sequence diagram we have four objects admin, apply SURF, Fuzzy c-
mean, Fuzzy association. Firstly Admin upload two images. After that apply the SURF
algorithm and we get interesting points and after that we apply the Fuzzy c-mean so that
the similar points are grouped into a clusters and we apply the FAR-HD algorithm to
generate the frequent item sets.
30
Administrator Apply SURF Fuzzy Clustering Fuzzy Association
1 : Upload Image1()
2 : Upload Image2()
3 : Get Intresting Points()
4 : Get SURF Values()
5 : Apply FCM()
6 : Get Cluster Results()
7 : Apply FAR-HD()
8 : Apply FAR-HD Algorithm()
9 : Generate Frequent Itemsets()
Figure 4.2 Sequence Diagram
31
4.1.3 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise

activities and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams are intended to model both computational and
organizational processes (i.e. workflows). Activity diagrams show the overall flow of
control.
Table 4.2 Graphical Notations for Activity Diagram
Action An action state represents a single step

within an activity that is one not further Activity1
decomposed of individual elements that are

actions. Action states have sets of incoming
and outgoing edges that specify control
flow and data flow from and to other nodes.
Initial state An initial node is a control node at which
flow starts when the activity is invoked .An
activity may have more than one initial
state. Initial sate in activity is represented as
filled circle.
For the branching of flows in two or more

Fork parallel flows we use a synchronization bar,
which is depicted as a thick horizontal or
vertical line.
Control Flow A control flow is an edge that starts an

activity node after the previous one is
finished.
32
Final state An activity may have more than one final
node. The first one reached stops all flow in
the activity.
In our activity diagram the flow is started from loading the two similar
images, later we appply the SURF algorithm to get the interesting points. After we have
to apply FCM and we get the clusterd values later we apply the FAR-HD to generate the
frequent item sets
33
START
Upload Image1 and Image2
Apply SURF
Get Intresting points
Apply FCM
Get clusterd Values
Apply FAR-HD
Generate Frequent Itemsets
Figure 4.3 Activity Diagram
34
4.1.4 StateChart Diagram
A state diagram is used to describe the behavior of systems. This behavior is

analyzed and represented in series of events that could occur in one or more possible
states. State diagrams require that the system described is composed of a finite number
of states. Sometimes, this is indeed the case, while at other times this is a
reasonable abstraction. Many forms of state diagrams exist, which differ slightly and
have different semantics.
Table 4.3 Graphical Notations for StateChart Diagram
Initial State The initial state represents the

source of all objects . A filled circle
followed by an arrow represents the
object’s intial state.
State State represent situations during the
State1
life of an object. Rectangular boxes
with curved edges represent a state.
Transition A transition represent the change

from one state to another. A solid
arrow represents the path between
different states of an object
Final State The final state represents the end of
an object exisistence. A final state is
not a real state, because objects in
this state donot exist any more. A
filled circle represents the final state.
35
In our state chart diagram we have four states starting from idle state and
later we upload the images and apply the SURF algorithm after that we get the interesting
points and later we apply the FCM to make the clusters.
Upload Images Apply SURF
entry/Upload two similar images entry/SURF algorithm applied on images

exit/Images uploaded exit/Get interesting points
Idle
Apply FCM
entry/Apply clustering
exit/Clusters are generated
Figure 4.4 State Chart Diagram
36
4.2Architecture Design
System architecture is the conceptual model that defines the structure, behavior,
and more views of a system. An architecture description is a formal description and
representation of a system, organized in a way that supports reasoning about
the structures of the system. System architecture can comprise system components, the
externally visible properties of those components, the relationships (e.g. the behavior)
between them.
An architectural design is the design of the entire software system; it gives a high-
level overview of the software system, such that the reader can more easily follow the
more detailed descriptions in the later sections. It provides information on the
decomposition of the system into modules (classes), dependencies between modules,
hierarchy and partitioning of the software modules.
High Dimensional data sets (Images, Pre-

SURF
Videos etc..) Processing
Interesting Points
FCM
FAR-HD Large number of clusters
Frequent item sets
Figure 4.5 System Architecture
37
Representation of SURF: In this section first we consider high dimensional data
sets such as images, videos etc. by applying SURF algorithm we get the interesting
points. Traditionally we use SIFT to get the Interesting points. But the SURF is
extension of the SIFT which is several times faster than the SIFT.
Fuzzy C-Mean: The Fuzzy C-Mean (FCM) algorithm is commonly used for
clustering the performance of the FCM algorithm depends on the selection of initial
cluster. If the initial cluster is good then the final cluster can be found very quickly
and the processing time can be drastically reduced. It is a data clustering technique
in which a dataset is grouped into n clusters with every data point in the dataset
belonging to every cluster to a certain degree. Consider an example, a certain data
point that lies close to the center of a cluster will have a high degree of belonging to
that cluster and another data point that lies far away from the center of a cluster will
have a low degree of belonging to that cluster.
Fuzzy Association Rule in High Dimensional Datasets (FAR-HD): FAR-HD is a

process which is used for high dimensional data sets to generate frequent item sets.
FAR-HD uses Fuzzy Clustering to convert each the original input dataset to a
fuzzy based clustering representation, which is ultimately used for the actual Fuzzy
ARM process. By using the FAR-HD we can mine the data from large datasets to
generate the frequent item sets.
Pre-processing techniques: Data pre-processing is often neglected but important

step in data mining. The phrase "garbage in, garbage out" is particularly applicable
to data mining and machine learning projects. If there is much irrelevant and
redundant information present or noisy and unreliable data, then knowledge
discovery during the accessing of data is more difficult. We apply some of the pre-
processing techniques to remove the noisy, incomplete and inconsistent data.
38
4.3 User Interface Design
In information technology, the user interface (UI) is everything designed into an

information device with which a human being may interact -- including display screen,
keyboard, mouse, light pen, the appearance of a desktop, illuminated characters, help
messages, and how an application program or a Web site invites interaction and responds
to it. In early computers, there was very little user interface except for a few buttons at an
operator's console. The user interface was largely in the form of punched card input and
report output.
Later, a user was provided the ability to interact with a computer online and the user
interface was a nearly blank display screen with a command line, a keyboard, and a set of
commands and computer responses that were exchanged. This command line interface
led to one in which menus (list of choices written in text) predominated. And, finally, the
graphical user interface (GUI) arrived, originating mainly in Xerox's Palo Alto Research
Centre, adopted and enhanced by Apple Computer, and finally effectively standardized
by Microsoft in its Windows operating systems.
The user interface can arguably include the total "user experience," which may include
the aesthetic appearance of the device, response time, and the content that is presented to
the user within the context of the user interface.
 Clarity: The information content is conveyed accurately.

 Discriminability: Our displayed information can be distinguished.
 Conciseness: Users are not confused with unrelated information.
 Consistency: A unique design, conformity with user’s expectation.
 Detectability: The user’s attention is directed towards information required.
 Legibility: Information is easy to read.
39
5. IMPLEMENTATION
5.1 Technology Description
About MATLAB
MATLAB is a programming language developed by Math Works. It started out as

a matrix programming language where linear algebra programming was simple. It can be
run both under interactive sessions and as a batch job.
MATLAB has an excellent set of graphic tools. Plotting a given data set or the
results of computation is possible with very few commands. You are highly encouraged
to plot mathematical functions and results of analysis as often as possible. Trying to
understand mathematical equations with graphics is an efficient way of learning
mathematics. Being able to plot mathematical functions and data freely is the most
important step.
MATLAB is a high-level language and interactive environment for numerical

computation, visualization, and programming. Using MATLAB, you can analyze data,
develop algorithms, and create models and applications. The language, tools, and built-in
math functions enable you to explore multiple approaches and reach a solution faster than
with spreadsheets or traditional programming languages, such as C/C++ or Java . You
can use MATLAB for a range of applications, including signal processing and
communications, image and video processing, control systems, test and measurement,
computational finance, and computational biology. More than a million engineers and
scientists in industry and academia use MATLAB, the language of technical computing
Key Features
 High-level language for numerical computation, visualization, and application

development
 Interactive environment for iterative exploration, design, and problem solving
40
 Mathematical functions for linear algebra, statistics, Fourier analysis, filtering,
optimization, numerical integration, and solving ordinary differential equations
 Built-in graphics for visualizing data and tools for creating custom plots
 Development tools for improving code quality and maintainability and
maximizing performance
 Tools for building applications with custom graphical interfaces
Advantages
 A very large (and growing) database of built-in algorithms for image processing
and computer vision applications
 MATLAB allows you to test algorithms immediately without recompilation. You
can type something at the command line or execute a section in the editor and
immediately see the results, greatly facilitating algorithm development.
 The MATLAB Desktop environment, which allows you to work interactively
with your data, helps you to keep track of files and variables, and simplifies
common programming/debugging tasks
 The ability to read in a wide variety of both common and domain-specific image
formats.
 The ability to call external libraries, such as OpenCV
 Clearly written documentation with many examples, as well as online resources
such as web seminars ("webinars").
 Bi-annual updates with new algorithms, features, and performance enhancements
 If you are already using MATLAB for other purposes, such as simulation,
optimation, statistics, or data analysis, then there is a very quick learning curve
for using it in image processing.
 The ability to process both still images and video.
 Technical support from a well-staffed, professional organization (assuming your
maintenance is up-to-date)
 A large user community with lots of free code and knowledge sharing
41
 The ability to auto-generate C code, using MATLAB Coder, for a large (and
growing) subset of image processing and mathematical functions, which you
could then use in other environments, such as embedded systems or as a
component in other software.
 MATLAB is a software development environment that offers high-performance
numerical computation, data analysis, visualization capabilities and application
development tools.
 MATLAB’s built-in graphing tools and GUI builder ensure that you customise
your data and models to help you interpret your data more easily for quicker
decision making.
42
5.2 Sample Source Code
Sample Code for Surf Algorithm
function ipts=OpenSurf(img,Options)
% This function OPENSURF, is an implementation of SURF (Speeded Up Robust
% Features). SURF will detect landmark points in an image, and describe
% the points by a vector which is robust against (a little bit) rotation
% ,scaling and noise. It can be used in the same way as SIFT (Scale-invariant
% feature transform) which is patented. Thus to align (register) two
% or more images based on corresponding points, or make 3D reconstructions.
% inputs,
% I : The 2D input image color or greyscale
% (optional)
% Options : A struct with options (see below)
% outputs,
% Ipts : A structure with the information about all detected Landmark points
% Ipts.x , ipts.y : The landmark position
% Ipts.scale : The scale of the detected landmark
% Ipts.laplacian : The laplacian of the landmark neighborhood
% Ipts.orientation : Orientation in radians
% Ipts.descriptor : The descriptor for corresponding point matching
43
% Add subfunctions to Matlab Search path
functionname='OpenSurf.m';
functiondir=which(functionname);
functiondir=functiondir(1:end-length(functionname));
addpath([functiondir '/SubFunctions'])
% Process inputs
defaultoptions=struct('tresh',0.0002,'octaves',5,'init_sample',2,'upright',false,'extended',fal
se,'verbose',false);
if(~exist('Options','var')),
Options=defaultoptions;
else
tags = fieldnames(defaultoptions);
for i=1:length(tags)
if(~isfield(Options,tags{i})), Options.(tags{i})=defaultoptions.(tags{i}); end
end
if(length(tags)~=length(fieldnames(Options))),
warning('register_volumes:unknownoption','unknown options found');
end
end
44
% Create Integral Image
iimg=IntegralImage_IntegralImage(img);
% Extract the interest points
FastHessianData.thresh = Options.tresh;
FastHessianData.octaves = Options.octaves;
FastHessianData.init_sample = Options.init_sample;
FastHessianData.img = iimg;
ipts = FastHessian_getIpoints(FastHessianData,Options.verbose);
% Describe the interest points
if(~isempty(ipts))
ipts = SurfDescriptor_DecribeInterestPoints(ipts,Options.upright, Options.extended,

iimg, Options.verbose);
end
45
Sample Code to get Intresting Points in Two Similar Images
% Caluclation of Intresting Points
% Load images
I1=im2double(imread('TestImages/11KD1A0549.jpg'));
I2=im2double(imread('TestImages/11KD1A0550.jpg'));
% Get the Key Points
Options.upright=true;
Options.tresh=0.0001;
Ipts1=OpenSurf(I1,Options);
Ipts2=OpenSurf(I2,Options);
% Put the landmark descriptors in a matrix
D1 = reshape([Ipts1.descriptor],64,[]);
D2 = reshape([Ipts2.descriptor],64,[]);
% Find the best matches
err=zeros(1,length(Ipts1));
cor1=1:length(Ipts1);
46
cor2=zeros(1,length(Ipts1));
for i=1:length(Ipts1),distance=sum((D2-repmat(D1(:,i),[1 length(Ipts2)])).^2,1);
distance=sum((D2-repmat(D1(:,i),[1 length(Ipts2)])).^2,1);
[err(i),cor2(i)]=min(distance);
end
% Sort matches on vector distance
[err, ind]=sort(err);
cor1=cor1(ind);
cor2=cor2(ind);
% Make vectors with the coordinates of the best matches
Pos1=[[Ipts1(cor1).y]',[Ipts1(cor1).x]'];
Pos2=[[Ipts2(cor2).y]',[Ipts2(cor2).x]'];
Pos1=Pos1(1:30,:);
Pos2=Pos2(1:30,:);
no1=(numel(Pos1));
no2=(numel(Pos2));
% Show both images
I = zeros([size(I1,1) size(I1,2)*2 size(I1,3)]);
47
I(:,1:size(I1,2),:)=I1; I(:,size(I1,2)+1:size(I1,2)+size(I2,2),:)=I2;
figure, imshow(I); hold on;
% Show the best matches
plot([Pos1(:,2) Pos2(:,2)+size(I1,2)]',[Pos1(:,1) Pos2(:,1)]','-');
plot([Pos1(:,2) Pos2(:,2)+size(I1,2)]',[Pos1(:,1) Pos2(:,1)]','o');
% Calculate affine matrix
Pos1(:,3)=1; Pos2(:,3)=1;
M=Pos1'/Pos2';
% Add subfunctions to Matlab Search path
functionname='OpenSurf.m';
functiondir=which(functionname);
functiondir=functiondir(1:end-length(functionname));
addpath([functiondir '/WarpFunctions'])
% Warp the image
I1_warped=affine_warp(I1,M,'bicubic');
48
% Show the result
figure,
subplot(1,3,1), imshow(I1);title('Figure 1');
subplot(1,3,2), imshow(I2);title('Figure 2');
subplot(1,3,3), imshow(I1_warped);title('Warped Figure 1');
49
Sample Code for Clustering
%clustering of two images
[center, U, obj_fcm] = fcm(Pos1, 2);
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
line(Pos1(index1, 1), Pos1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(Pos1(index2,1),Pos1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
50
Sample Code for Wraping of Images
function Iout=affine_warp(Iin,M,mode)
% Affine transformation function (Rotation, Translation, Resize)
% This function transforms a volume with a 3x3 transformation matrix
% inputs,
% Iin: The input image
% Minv: The (inverse) 3x3 transformation matrix
% mode: If 0: linear interpolation and outside pixels set to nearest pixel
% 1: linear interpolation and outside pixels set to zero
% (cubic interpolation only support by compiled mex file)
% 2: cubic interpolation and outsite pixels set to nearest pixel
% 3: cubic interpolation and outside pixels set to zero
% output,
% Iout: The transformed image
% Make all x,y indices
[x,y]=ndgrid(0:size(Iin,1)-1,0:size(Iin,2)-1);
% Calculate center of the image
% mean= size(Iin)/2;
% Make center of the image coordinates 0,0
%xd=x-mean(1);
51
%yd=y-mean(2);
xd=x;
yd=y;
% Calculate the Transformed coordinates
Tlocalx = mean(1) + M(1,1) * xd + M(1,2) *yd + M(1,3) * 1;
Tlocaly = mean(2) + M(2,1) * xd + M(2,2) *yd + M(2,3) * 1;
switch(mode)
case 0
Interpolation='bilinear';
Boundary='replicate';
case 1
Interpolation='bilinear';
Boundary='zero';
case 2
Interpolation='bicubic';
Boundary='replicate';
otherwise
Interpolation='bicubic';
Boundary='zero';
52
end
Iout=image_interpolation(Iin,Tlocalx,Tlocaly,Interpolation,Boundary);
Function Iout = image_interpolation (Iin,Tlocalx,Tlocaly,Interpolation,Boundary,

ImageSize)
% This function is used to transform an 2D image
% inputs,
% Iin : 2D greyscale or color input image
% Tlocalx,Tlocaly : (Backwards) Transformation images for all image pixels
% Interpolation:
% 'nearest' - nearest-neighbor interpolation
% 'bilinear' - bilinear interpolation
% 'bicubic' - cubic interpolation; the default method
% Boundary:
% 'zero' - outside input image are implicilty assumed to be zero
% 'replicate' - Input array values outside the bounds of the array
% are assumed to equal the nearest array border value
% (optional)
% ImageSize: - Size of output image
% outputs,
% Iout : The transformed image
53
if(~isa(Iin,'double')), Iin=double(Iin); end
if(nargin<6), ImageSize=[size(Iin,1) size(Iin,2)]; end
if(ndims(Iin)==2), lo=1; else lo=3; end
switch(lower(Interpolation))
case 'nearest'
xBas0=round(Tlocalx);
yBas0=round(Tlocaly);
case 'bilinear'
xBas0=floor(Tlocalx);
yBas0=floor(Tlocaly);
xBas1=xBas0+1;
yBas1=yBas0+1;
% Linear interpolation constants (percentages)
tx=Tlocalx-xBas0;
ty=Tlocaly-yBas0;
perc0=(1-tx).*(1-ty);
perc1=(1-tx).*ty;
perc2=tx.*(1-ty);
perc3=tx.*ty;
54
case 'bicubic'
xBas0=floor(Tlocalx);
yBas0=floor(Tlocaly);
tx=Tlocalx-xBas0;
ty=Tlocaly-yBas0;
% Determine the t vectors
vec_tx0= 0.5; vec_tx1= 0.5*tx; vec_tx2= 0.5*tx.^2; vec_tx3= 0.5*tx.^3;
vec_ty0= 0.5; vec_ty1= 0.5*ty; vec_ty2= 0.5*ty.^2;vec_ty3= 0.5*ty.^3;
% t vector multiplied with 4x4 bicubic kernel gives the to q vectors
vec_qx0= -1.0*vec_tx1 + 2.0*vec_tx2 - 1.0*vec_tx3;
vec_qx1= 2.0*vec_tx0 - 5.0*vec_tx2 + 3.0*vec_tx3;
vec_qx2= 1.0*vec_tx1 + 4.0*vec_tx2 - 3.0*vec_tx3;
vec_qx3= -1.0*vec_tx2 + 1.0*vec_tx3;
vec_qy0= -1.0*vec_ty1 + 2.0*vec_ty2 - 1.0*vec_ty3;
vec_qy1= 2.0*vec_ty0 - 5.0*vec_ty2 + 3.0*vec_ty3;
vec_qy2= 1.0*vec_ty1 + 4.0*vec_ty2 - 3.0*vec_ty3;
vec_qy3= -1.0*vec_ty2 + 1.0*vec_ty3;
55
% Determine 1D neighbour coordinates
xn0=xBas0-1; xn1=xBas0; xn2=xBas0+1; xn3=xBas0+2;
yn0=yBas0-1; yn1=yBas0; yn2=yBas0+1; yn3=yBas0+2;
otherwise
error('image_interpolation:inputs','unknown interpolation method');
end
% limit indexes to boundaries
case 'nearest'
check_xBas0=(xBas0<0)|(xBas0>(size(Iin,1)-1));
check_yBas0=(yBas0<0)|(yBas0>(size(Iin,2)-1));
xBas0=min(max(xBas0,0),size(Iin,1)-1);
yBas0=min(max(yBas0,0),size(Iin,2)-1);
case 'bilinear'
56
case 'bicubic'
check_xn0=(xn0<0)|(xn0>(size(Iin,1)-1));
check_yn0=(yn0<0)|(yn0>(size(Iin,2)-1));
xn0=min(max(xn0,0),size(Iin,1)-1);
yn0=min(max(yn0,0),size(Iin,2)-1);
end
57
Iout=zeros([ImageSize(1:2) lo]);
for i=1:lo; % Loop incase of RGB
Iin_one=Iin(:,:,i);
case 'nearest'
% Get the intensities
intensity_xyz0=Iin_one(1+xBas0+yBas0*size(Iin,1));
% Set pixels outside the image
switch(lower(Boundary))
case 'zero'
intensity_xyz0(check_xBas0|check_yBas0)=0;
otherwise
end
% Combine the weighted neighbour pixel intensities
Iout_one=intensity_xyz0;
case 'bilinear'
58
case 'zero'
otherwise
end
Iout_one=intensity_xyz0.*perc0+intensity_xyz1.*perc1+intensity_xyz2.*perc2+intensity
_xyz3.*perc3;
case 'bicubic'
Iy0x0=Iin_one(1+xn0+yn0*size(Iin,1));Iy0x1=Iin_one(1+xn1+yn0*size(Iin,1));
59
case 'zero'
Iy0x0(check_yn0|check_xn0)=0;Iy0x1(check_yn0|check_xn1)=0;
otherwise
end
60
Iout_one=vec_qy0.*(vec_qx0.*Iy0x0+vec_qx1.*Iy0x1+vec_qx2.*Iy0x2+vec_qx3.*Iy0x
3)+...
vec_qy1.*(vec_qx0.*Iy1x0+vec_qx1.*Iy1x1+vec_qx2.*Iy1x2+vec_qx3.*Iy1x3)+...
vec_qy2.*(vec_qx0.*Iy2x0+vec_qx1.*Iy2x1+vec_qx2.*Iy2x2+vec_qx3.*Iy2x3)+...
vec_qy3.*(vec_qx0.*Iy3x0+vec_qx1.*Iy3x1+vec_qx2.*Iy3x2+vec_qx3.*Iy3x3);
end
Iout(:,:,i)=reshape(Iout_one, ImageSize);
end
61
6. TESTING
Testing is the major quality control measure employed for software

development. Its basic function is to detect errors in the software. During requirement
analysis and design, the output is a document which is usually textual and non-textual.
After the coding phase, computer programs are available that can be executed for testing
purpose. This implies that testing has to uncover errors introduced during coding phases.
Thus, the goal of testing is to cover requirement, design, or coding errors in the program.
The purpose is to exercise the different parts of the module code to detect coding errors.
After this, the modules are gradually integrated into subsystems, which are then
integrated themselves to eventually form the entire system. During the module
integration, testing is performed. The goal is to detect designing errors, while focusing
the interconnection between modules. After the system was put together, system testing
is performed. Here the system is tested against the system requirements to see if all
requirements were met and the system performs as specified by the requirements.
Finally, testing is performed to demonstrate to the client for the operation of the system.
For the testing to be successful, proper selection of the test case is essential.
There are two different approaches for selecting test case. The software or the module to
be tested is treated as a black box, and the test cases are decided based on the
specifications of the system or module. For this reason, this form of testing is also called
“black box testing”.
The focus here is on testing the external behavior of the system. In structural
testing, the test cases are decided based on the logic of the module to be tested. A
common approach here is to achieve some type of coverage of the statements in the
code. The two forms of testing are complementary: one tests the external behavior, the
other tests the internal structure. Often structural testing is used for lower levels of
testing, while functional testing is used for higher levels.
Testing is an extremely critical and time-consuming activity. It requires proper

planning of the overall testing process. Frequently the testing process starts with the test
62
plan. This plan identifies all testing related activities that must be performed and
specifies the schedule, allocates the resources, and specifies guidelines for testing. The
test plan specifies conditions that should be tested; different units to be tested, and the
manner in which the module will be integrated together. Then for different test unit, a
test case specification document is produced, which lists all the different test cases,
together with the expected outputs, that will be used for testing. During the testing of the
unit the specified test cases are executed and the actual results are compared with the
expected outputs. The final output of the testing phase is the testing report and the error
report, or a set of such reports. Each test report contains a set of test cases and the result
of executing the code with the test cases. The error report describes the errors
encountered and the action taken to remove the error.
6.1 Testing approach
Testing is a process, which reveals the errors in a program. It is the major quality
measure employed during software development. During testing, the program is executed
with a set of conditions known as test cases and the output is evaluated to determine
whether the program is performing as expected. In order to make sure that the system
does not have errors, the different levels of testing strategies are applied at differing
phases of software development are as follows.
6.1.1 Unit Testing
Unit Testing is done on individual modules as they are completed and become
executable. It is confined only to the designer's requirements.
In our project (FUZZY BASED CLUSTERING OF HIGH DIMENSIONAL

DATA) we have two modules they are GETTING INTERESTING POINTS using
SURF algorithm module and CLUSTERING module. Each module will be tested
individually.
63
6.1.2 Each module can be tested using the following two strategies
6.1.2.1 Black Box Testing
Internal system design is not considered in this type of testing. Tests are based on
the requirements and the functionality. This testing is used to find the errors in the
following categories:
 Incorrect or missing functions

 Interface errors
 Errors in data structure
 Performance errors
 Initialization and termination errors.
In this testing, only the output is checked for correctness but the logical flow of
the data is not checked.
6.1.2.2 White Box Testing
This testing is based on the knowledge of the internal logic of an application’s

code. Also known as Glass box Testing. Internal software and code working should be
known for this type of testing. Tests are based on coverage of code, statements, etc. It is
used to generate the test cases in the following cases:
 Guarantee that all the independent paths have been executed.

 Execute all the logical decisions on their true and false sides.
 Execute all the loops at their boundaries and within their operational
 Execute the internal data structures to ensure their validity.
6.1.3 Integration Testing
Integration testing ensures that the software and the subsystems work together as
a whole. It tests the interface of all the modules to make sure that the modules behave
properly or not when integrated together.
64
6.1.4 System Testing
It involves in-house testing of the entire system before the delivery to the user.
Its aim is to satisfy the user and the system that meets all the requirements of the client's
specifications.
6.1.5 Acceptance Testing
It is a pre-delivery testing in which the entire system is tested at the client's site
on the real world data to find errors.
6.1.6 Validation Testing
The system is tested and implemented successfully and thus ensured that all the
requirements as listed in the software requirements specification are completely
fulfilled. In case of erroneous input corresponding error messages are displayed.
6.1.7 Compiling test
It was a good idea to do our stress testing early, because it gives us time to fix
some of the unexpected exceptions and stability problems that only occur when the
components were exposed to very high transaction volumes.
6.1.8 Execution test
Finally, the program was successfully loaded and executed.
6.1.9 Output test
The successful output screens are placed in the output screens section.
65
6.2 Test cases
Table6 .1 Test Cases
S No. Description Expected Input Actual Result

Value Value
1. Image1is Identical Two Correct Pass
similar to match similar wrapping of
image2 images image
2. Image 1 is Less Identical Two Incorrect Fail

different from different wrapping of
image 2 images image
3. Image 1 and Exact match Two same Correct Pass

image 2 are images wrapping of
same image
4. Image 2 is Less identical Two Incorrect Fail

different from different wrapping of
image 1 images image
66
7.SCREEN SHOTS
7.1 Getting of Interesting Points for Similar Images
7.2 Wrapping of images
67
7.3 Generating Frequent Itemsets
7.4 Clustering of Images
68
7.5 Analysis of Similarity between Images
69
7.6 Getting of Interesting Points for Different Images
7.7 Wrapping of images
70
7.8 Generating Frequent Itemsets
7.9 Clustering of Images
71
7.10 Analysis of Similarity between Images
72
8.CONCLUSION
This project provides a solution to handle the high dimensional data such as
images into clusters. Fuzzy mechanism is used to arrange only the similar data into
clusters.
The purpose of current study was to present a method to cluster high dimensional
data efficiently. The association rule mining was studied for very large high dimensional
data in the image domain. The SURF algorithm is capable of finding interesting points
from two similar image datasets and we are using some of the clustering techniques to the
datasets to increase the speed and efficiency in large high dimensional data. Now we are
considered image as data in future we can also apply with videos also.
73
REFERENCES
1. A. Mangalampalli and V. Pudi, "FAR-miner: a fast and efficient

algorithm for fuzzy association rule mining," IJBIDM, vol. 7, no. 4,
pp. 288-317, 2012.
2. A. Mangalampalli and V. Pudi, "Fuzzy association rule mining
algorithm for fast and efficient performance on very large datasets," in
FUZZ-IEEE, 2009, pp. 1163-1168.
3. R. Agrawal, T. Imielinski, and A. N. Swami, "Mining association

rules between sets of items in large databases," in SIGMOD
Conference, 1993, pp. 207-216.
4. H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool, "Speeded-up robust
features (SURF)," Computer Vision and Image Understanding, vol.
110, no. 3, pp. 346-359, 2008.
5. Lowe, David G. (1999). "Object recognition from local scale-invariant
features". Proceedings of the International Conference on Computer
Vision 2. pp. 1150–1157.
6. Ahmed, Mohamed N.; Yamany, Sameh M.; Mohamed, Nevin; Farag,
Aly A.; Moriarty, Thomas (2002). "A Modified Fuzzy C-Means
Algorithm for Bias Field Estimation and Segmentation of MRI
Data". IEEE Transactions on Medical Imaging 21 (3): 193–
199. doi:10.1109/42.996338. PMID 11989844
7. Ward, Greg (2006). "Hiding seams in high dynamic range
panoramas". Proceedings of the 3rd symposium on Applied
perception in graphics and visualization. ACM International
Conference Proceeding Series 153. ACM. doi:10.1145/
1140491.1140527.ISBN 1-59593-429-4.
74

1.1 Project Overview: Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.1 Project Overview: Data Mining

Uploaded by

Copyright:

Available Formats

1.

1.1 Project Overview

Data mining, the Extraction Of hidden predictive information from large

Data mining is seen as an increasingly important tool by modern business to

Basic Data mining tasks

Clustering is a data mining (machine learning) technique used to place data

 A cluster of data objects can be treated as a one group.

 The main advantage of Clustering over classification is that, It is adaptable to changes

Summarization is the abstraction or generalization of data. A set of task- relevant

The summarization can go up to the different levels of abstraction and can be

Dependency Modeling consists of finding a model which describes significant

 The structural level of the model specifies (often graphically) which

 The scalability of clustering method.

Clustering is a challenging field of research in which its potential applications

 Automatic recovery from failure.

 Complexity and inability to recover from the database corruption.

Efficient partitioning of high dimensional data sets into clusters is a fundamental

We are currently working on SURF algorithm for the object recognition or 3D

1.3.2 Working of Proposed System

SURF (Speeded Up Robust Features) is a robust local feature detector, first

It uses an integer approximation to the determinant of Hessian blob detector,

Algorithm and features

Points of interest in the Hessian matrix

Where Lxx(𝑥, 𝜎) is the convolution of second order derivative with the

Images are calculated :

-Dxx (x, y ) from I ( x, y) and Gxx ( x, y)

-Dxy (x, y ) from I ( x, y) and Gxy (x, y )

-Dyy (x, y ) from I ( x, y) and Gyy (x, y)

Then, the following image is generated:

is the Frobenius norm:

Frobenius changes of weighting depends on scale σ . In practice, this factor is kept

Then, the local maxima are searched.

Scale-space representation & location of points of interest

(𝐵𝑎𝑠𝑒 𝐹𝑖𝑙𝑡𝑒𝑟 𝑆𝑐𝑎𝑙𝑒)

After 3D maxima are looking at ( x, y, n) using the cube 3x3x3 neighborhood .

Specifically, it is entered by variant a quick and Van Gool Neubecker used.The

It has one position (x,y) for each interest point.

 Choose a number of clusters.

2.1 Fuzzy Apriori

2.2 Mining Fuzzy Association Rules in Large High-Dimensional

Fuzzy Association Rule Mining (ARM) has been extensively used in

2.3 Fuzzy Cluster-Based Association Rules (FCBAR)

The FCBAR method is to create cluster tables by scanning the database

Step 1: Scale-Space Extrema Detection

Where * is the convolution operator, G(x, y, σ) is a variable-scale Gaussian and

D(x, y, σ) = L(x, y, kσ) – L(x, y, σ)

Step 2: Keypoint Localistaion

 Compute gradient magnitude

Keypoint descriptors typically uses a set of 16 histograms, aligned in a 4x4 grid,

3.1 System requirement specifications

3.1.1 Functional Requirements

In product development, it is useful to distinguish between the baseline

3.1.2 Non Functional Requirement

 Portability: The degree of conversion of our application to target system is easy

• Identify the actors.

• Develop a simple business process model using UML activity diagram.

• Develop use cases.

• Prepare interaction diagrams.

• Classification – develop a static UML class diagrams

Identify classes, relationships, attributes, methods.

• Use Case Model

A Use case is a description of the behavior of the system. That description is