A Rapid Hybird Clustring Algorithm For A Large Volumes of High

A RAPID HYBIRD CLUSTRING ALGORITHM FOR A LARGE
VOLUMES OF HIGH DIMENSIONAL DATA

Abstract
Clustering large volumes of high-dimensional data is a challenging task. Many clustering
algorithms have been developed to address either handling datasets with a very large sample size
or with a very high number of dimensions, but they are often impractical when the data is large
in both aspects. To simultaneously overcome both the ‘curse of dimensionality’ problem due to
high dimensions and scalability problems due to large sample size, we propose a new fast
clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering
algorithm which uses fast data-space reduction and an intelligent sampling strategy. In addition
to clustering, FensiVAT also provides visual evidence that is used to estimate the number of
clusters (cluster tendency assessment) in the data. In our experiments, we compare FensiVAT
with nine state-of-the-art approaches which are popular for large sample size or high-
dimensional data clustering. Experimental results suggest that FensiVAT, which can cluster large
volumes of high-dimensional datasets in a few seconds, is the fastest and most accurate method
of the ones tested.

Acknowledgment
List of figures
List of abbreviations.
1. Privacy-preserving data publishing,

2. Customized privacy Protection,
3. Personalization,
4. Ranking-based recommendation
5. Social media,
6. Location based social networks
Table content:
- Abstract
- Introduction
- - objectives
-Overview of the system
- System analysis
- Existing system
- Proposed system.
- Feasibility study
- Technical feasibility
- Operational feasibility
- Economical feasibility
- System Requirements
- Modules description
- SDLC methodology
- Software requirement
- Hardware requirement
- System design
-UML
- Technology description.
- coding
- testing
- Output screens
- Conclusion
- Bibliography
- References.
Abstract
Clustering large volumes of high-dimensional data is a challenging task. Many clustering
algorithms have been developed to address either handling datasets with a very large sample size
or with a very high number of dimensions, but they are often impractical when the data is large
in both aspects. To simultaneously overcome both the ‘curse of dimensionality’ problem due to
high dimensions and scalability problems due to large sample size, we propose a new fast
clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering
algorithm which uses fast data-space reduction and an intelligent sampling strategy. In addition
to clustering, FensiVAT also provides visual evidence that is used to estimate the number of
clusters (cluster tendency assessment) in the data. In our experiments, we compare FensiVAT
with nine state-of-the-art approaches which are popular for large sample size or high-
dimensional data clustering. Experimental results suggest that FensiVAT, which can cluster large
volumes of high-dimensional datasets in a few seconds, is the fastest and most accurate method
of the ones tested

INTRODUCTION
Objectives
 Data mining refers to the development of extracting or mining knowledge from sample
amounts of data. It is the process of searching available patterns by scanning the huge
amount of data. Storing enormous quantity of data is utile to extract precious knowledge. To
seek out constructive patterns within the data, there are different kinds of algorithms which
can categorize the data either automatically or semi-automatically. These patterns are used to
obtain the sets of rules. The patterns discovered must be meaningful such that they may lead
to many advantages like decisions making, market analysis, financial growth, business
intelligence etc. To get such meaningful patterns, significantly large amount of data is
required.
 In data mining, clustering is one of the framework in which data objects are grouped together
without consulting a known class label. In clustering data groupings are not pre-defined;
instead they are generated by finding the similarities between the data objects according to
the characteristics found in the actual data. Based on this similarity the dataset is partitioned
into several groups or clusters in such a way that objects within a cluster have high similarity
in comparison with one another but are very dissimilar to objects in other clusters. In other
words, a good clustering algorithm should maximize the intra-cluster similarity and minimize
the inter-cluster similarity . Big data is one of the new challenges in data mining because
large volumes of high dimensional data and different varieties must be taken into account.
The common methods and tools for data processing and analysis are unable to manage such
amounts of data, even if powerful computer clusters are used. To analyze big data, many new
data mining and machine learning algorithms as well as technologies have been developed.
So, big data do not only yield new data types and storage mechanisms, but also new methods
of analysis.
 When dealing with big data, a data clustering problem is one of the most important issues.
Often data sets, especially big data sets, consist of some groups (clusters) and it is necessary
to find the groups. Clustering methods have been applied to many important problems for
example, to discover healthcare trends in patient records, to eliminate duplicate entries in
address lists, to identify new classes of stars in astronomical data, to divide data into groups
that are meaningful, useful, to cluster millions of documents or web pages. To address these
applications and many others a variety of clustering algorithms has been developed. There
exist some limitations in the existing clustering methods; most algorithms require scanning
the data set for several times, thus they are unsuitable for big data clustering. There are a lot
of applications in which extremely large or big data sets need to be explored, but which are
much too large to be processed by traditional clustering methods.
 To deal with large amounts of high-dimensional data, this paper introduces a rapid, Enhanced
Fuzzy based Linkage Clustering Algorithm (EFCA), which efficiently integrates (i) cluster
assignment technique; and (ii) Linkage Clustering. The objective of this paper is an effective
aggregation of cluster partitions, which are obtained using the EFCA on synthetic and real-
world datasets.
Overview of the system

Data clustering is an essential method of exploratory data analysis in which data are partitioned
into several subsets of similar objects. With the rapid advancement of the Internet of Things
(IoT) technologies, and social network services, we witness tremendous growth of data not only
in the volume of the data, but also in the number of features collected for each data object. In
many applications such as biomedical imaging, sequencing, and time series matching, the dataset
may consist of millions of instances in hundreds to thousands of dimensions. The two most
important ways a dataset can be big are: (1) it has a very large number (N) of instances, and (2)
each instance has many features (p) i.e. it is high-dimensional data. A variety of clustering
algorithms have been developed for a dataset that has either (1) large N but small p, or (2) small
N but large p, but most clustering algorithms are impractical for handling datasets that are large
jointly in N and p. Most existing clustering algorithms encounter serious problems related to
computational complexities and/or cluster quality for big datasets. Many papers and surveys
discuss different clustering approaches for big datasets. The most popular algorithms are based
on partitioning and hierarchical techniques. Among them single pass k-means, mini-batch k-
means, CLARA (Clustering Large Applications) and CURE (Clustering Using Representatives)
are the most widely known for big datasets. A single linkage (SL) type algorithm called
clustering with improved visual assessment of tendency (clusiVAT) has shown promising results
for big datasets. Most of these clustering algorithms use sampling based strategies to reduce
computational time. However, they still take a lot of time to cluster very large volumes of high-
dimensional data...
SYSTEM ANALYSIS
Existing system
A variety of clustering algorithms have been developed for a dataset that has either large N but
small p, or small N but large p, but most clustering algorithms are impractical for handling
datasets that are large jointly in N and p. Most existing clustering algorithms encounter serious
problems related to computational complexities and/or cluster quality for big datasets. Many
papers and surveys discuss different clustering approaches for big datasets. The most popular
algorithms are based on partitioning and hierarchical techniques.
DISADVANTAGES
 The siVAT scheme does not involve any sensitive threshold parameter, and requires the user
to supply only two parameters: n the desired sample size, and k0, an overestimate of k, the
assumed number of clusters, to obtain k0 distinguished objects in the sample.
 Subspace clustering methods do not suffer from nearest neighbor problems in high-
dimensional space. PROCLUS is a subspace clustering approach, which first samples the
data, then selects a set of k medoids, and iteratively improves the clustering. PROCLUS is
capable of discovering arbitrarily shaped clusters in high-dimensional datasets. However,
PROCLUS is very sensitive to input parameters, and is not efficient for very large N
Proposed system
To deal with large amounts of high-dimensional data, this paper introduces a rapid, hybrid
clustering algorithm, which efficiently integrates (i) a new random projection (RP) based
ensemble technique; (ii) an improved visual assessment of cluster tendency (iVAT) algorithm ,
and (iii) a smart sampling strategy, called Maximin and Random Sampling (MMRS) . The
proposed method achieves fast clustering by combining ensembles of random projections with
scalable version of iVAT, hence we call it FensiVAT. FensiVAT aggregates multiple distance
matrices, computed in a lower-dimensional space, to obtain the iVAT image in a fast and
efficient manner, which provides visual evidence about the number of clusters to seek in the
original dataset. MMRS sampling picks distinguished objects from the dataset, hence it requires
relatively very few samples compared to random sampling to yield a diverse subset of the big
data that represents the cluster structure in the original (big) dataset...
ADVANTAGES OF PROPOSED SYSTEM
 Therefore, these approaches either take hours for large size datasets having
hundreds to thousands of dimensions, and/or sacrifice accuracy for faster
computation time. Moreover, datasets used in these papers are not considered large
in today’s computing environment.
 We use a statistical measure to compare the cluster distributions in samples
obtained from three sampling strategies: random sampling, MMRS sampling in the
dimensional up space, and MMRS sampling in the dimensional down space (we
will call this type of sampling Near-MMRS). Our experiments show that Near-
MMRS samples accurately portray the distribution of the original data in lower
dimensions.
FEASIBILITY STUDY
PRELIMINARY INVESTIGATION
The first and foremost strategy for development of a project starts from the thought of
designing a mail enabled platform for a small firm in which it is easy and convenient of sending
and receiving messages, there is a search engine ,address book and also including some
entertaining games. When it is approved by the organization and our project guide the first
activity, ie. Preliminary investigation begins. The activity has three parts:
 Request Clarification
 Feasibility Study
 Request Approval
REQUEST CLARIFICATION
After the approval of the request to the organization and project guide, with an
investigation being considered, the project request must be examined to determine precisely what
the system requires. Here our project is basically meant for users within the company whose
systems can be interconnected by the Local Area Network(LAN). In today’s busy schedule man
need everything should be provided in a readymade manner. So taking into consideration of the
vastly use of the net in day to day life, the corresponding development of the portal came into
existence.
FEASIBILITY ANALYSIS
An important outcome of preliminary investigation is the determination that the system request
is feasible. This is possible only if it is feasible within limited resource and time. The different
feasibilities that have to be analyzed are
 Operational Feasibility
 Economic Feasibility
 Technical Feasibility
Operational Feasibility
Operational Feasibility deals with the study of prospects of the system to be developed.
This system operationally eliminates all the tensions of the Admin and helps him in effectively
tracking the project progress. This kind of automation will surely reduce the time and energy,
which previously consumed in manual work. Based on the study, the system is proved to be
operationally feasible.
Economic Feasibility
Economic Feasibility or Cost-benefit is an assessment of the economic justification for a

computer based project. As hardware was installed from the beginning & for lots of purposes
thus the cost on project of hardware is low. Since the system is a network based, any number of
employees connected to the LAN within that organization can use this tool from at anytime. The
Virtual Private Network is to be developed using the existing resources of the organization. So
the project is economically feasible.
Technical Feasibility
According to Roger S. Pressman, Technical Feasibility is the assessment of the technical
resources of the organization. The organization needs IBM compatible machines with a graphical
web browser connected to the Internet and Intranet. The system is developed for platform
Independent environment. Java Server Pages, JavaScript, HTML, SQL server and WebLogic
Server are used to develop the system. The technical feasibility has been carried out. The system
is technically feasible for development and can be developed with the existing facility.
SYSTEM REQUIREMENTS
Modules description
Distance Matrix using Ensemble Method:
The third (previous) step provides n samples in the down space, Sd _ Rq, which can be used to
build an n_n distance matrix Dn;d. We need a reliable iVAT image in order to select the number
of clusters obtained by SL in penultimate steps of FensiVAT. The VAT/iVAT image provides a
subjective visual assessment of potential cluster substructure based on how distinctive the dark
blocks (clusters) appear in the image. However, the quality of the image of the reordered
distance matrix D0_n;d, obtained by applying VAT/iVAT to Dn;d, often turns out to be very
poor due to the unstable nature of random projection. Hence, we turned to an ensemble-based
approach to obtain a good quality iVAT image from multiple reordered distance matrices
(fD0_d;igQi =1) in the down space. Since the ordering of the data in every reordered matrix
D0_d;i may be different, it is not feasible to directly aggregate multiple reordered distance
matrices (fD0_d;igQi =1). Therefore, we devised a new method to aggregate the Q n _ n
ensemble of distance matrices to obtain a better quality iVAT image.
Clustering:
All single linkage partitions are aligned partitions in the VAT/iVAT ordered matrices, so SL is
an obvious choice for the clustering algorithm in Step 9. Having the estimate of the number of
clusters, k from the previous step, we cut the k􀀀1 longest edges in the iVAT-built MST,
resulting in k single linkage clusters. If the dataset is complex and clusters are intermixed,
cutting the k 􀀀1 longest edges may not always be a good strategy as the data points (outliers),
which are typically furthest from normal clusters, might comprise most of the k􀀀1 longest edges
of the MST, leading to misleading partitions. Such data points need to be partitioned (usually in
their own cluster) before a reliable partition can be found via the SL criterion. However, the
iVAT image provides visual evidence as to how large the clusters should be. Thus, if the size of
SL-clusters does not match well the visual evidence, then the partition can be discarded (perhaps
choosing a different clustering algorithm to partition the sample of feature vectors in Rp or
throwing out data from small clusters).
Extension
In the extension step (Step 10) of FensiVAT, we label the remaining N = (N􀀀n) data points in O,
by giving them the label of their nearest object in Sd. This requires the computation of an n_ Ñ
size matrix, ˆˆD, with computational complexity O(qnÑ). In this step, we use the sample Sd and
feature vectors Y in Rq (obtained in Step 2) to compute the distance matrix ˆˆD. This further
reduces the computation time which would be needed for the equivalent operation in Rp. Next,
the remaining Ñ data points in O are labeled using this distance matrix, based on the label of the
nearest object in Sd. Although, a single random projection (RP) might be sufficient to achieve
comparable accuracy in the NOPR labeling step, several, RPs are used to best ensure robust
nearest neighbor search in NOPR. First, multiple RPs are applied on the full dataset to get
multiple Ys. Then, the sample labels are extended to each of these Ys using NOPR, which would
give multiple sets of labels fU(i) ˆ y gQ i=1 for full dataset. The final labels (U) are selected
using voting, based on the labels cast by each voter from each RP, for each remaining data point
in O.
Partition Accuracy:
For all datasets, except US Census 1990, the quality of the output crisp partition obtained by
various clustering algorithms is assessed using ground truth information, Ugt . The similarity of
computed partitions with respect to ground truth labels is measured using the partition accuracy
(PA). The PA of a clustering algorithm is the ratio of the number of samples with matching
ground truth and algorithmic labels to the total number of samples in the dataset. The value of
PA ranges from 0 to 1, and a higher value imply a better match to the ground truth partition.
SDLC methodology
INPUT DESIGN
Input Design plays a vital role in the life cycle of software development, it requires very
careful attention of developers. The input design is to feed data to the application as accurate as
possible. So inputs are supposed to be designed effectively so that the errors occurring while
feeding are minimized. According to Software Engineering Concepts, the input forms or screens
are designed to provide to have a validation control over the input limit, range and other related
validations.
This system has input screens in almost all the modules. Error messages are developed to
alert the user whenever he commits some mistakes and guides him in the right way so that
invalid entries are not made. Let us see deeply about this under module design.
Input design is the process of converting the user created input into a computer-based
format. The goal of the input design is to make the data entry logical and free from errors. The
error is in the input are controlled by the input design. The application has been developed in
user-friendly manner. The forms have been designed in such a way during the processing the
cursor is placed in the position where must be entered. The user is also provided with in an
option to select an appropriate input from various alternatives related to the field in certain cases.
Validations are required for each data entered. Whenever a user enters an erroneous data,
error message is displayed and the user can move on to the subsequent pages after completing all
the entries in the current page.
OUTPUT DESIGN
The Output from the computer is required to mainly create an efficient method of
communication within the company primarily among the project leader and his team members,
in other words, the administrator and the clients. The output of VPN is the system which allows
the project leader to manage his clients in terms of creating new clients and assigning new
projects to them, maintaining a record of the project validity and providing folder level access to
each client on the user side depending on the projects allotted to him. After completion of a
project, a new project may be assigned to the client. User authentication procedures are
maintained at the initial stages itself. A new user may be created by the administrator himself or
a user can himself register as a new user but the task of assigning projects and validating a new
user rests with the administrator only.
The application starts running when it is executed for the first time. The server has to be started
and then the internet explorer in used as the browser. The project will run on the local area
network so the server machine will serve as the administrator while the other connected systems
can act as the clients. The developed system is highly user friendly and can be easily understood
by anyone using it even for the first time
FUNCTIONAL REQUIREMENTS:
A functional requirement defines a function of a software system or its component. A function is

described as a set of input process and output.
Input: this should require dataset information data. It for used evaluation.
Process: depend on algorithms analysis works. We analyses security, items search using
algorithms using.
Store: given input will be stored and get from databases.
Output: output will displayed depend in mining algorithms.
NON-FUNCTIONAL REQUIREMENTS:
Usability: This should be given the leading priority. This should be able to log into system with
ease and should be able to access all grants. A User can learn to operate prepare inputs for
interpret outputs on a system.
Reliability: This is the ability of system component to perform it required functions understand
Condition for a specified period on time. Reliability includes mean time to security attacks or
failures. One of the main factors that are used to determines the important requirement of any
application.
Performance: It is concerned with quantifiable activates of the system. System must have
internet facility to maintain an accurate date and time and transfer operations.
Supportability: As this application is made up of Java resources, it should not be a problem

moving to other server operating systems.
Implementation: The client is implemented in Java, it can run on any browser where the user
will be able to operate the system.
Operations: The operations requirements are constraints on the Boolean keywords and query
conditions.
Extensibility: This system should be flexible in such a way that it can be easily extended in
order to add some more modules in the future.
Hardware Constraints:
Processor : Any Processor above 500 Mhz.
Ram : 128Mb.
Hard Disk : 10 Gb.
Compact Disk : 650 Mb.
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.
Software Constraints:
Operating System : Windows 2000 server Family.
Techniques : Java
Front End : JSP
IDE : .Netbeans
Database : MySql
SYSTEM DESIGN
SYSTEM DESIGN
Identifying Design Goals
There are several reasons to identify the design goals of any system. These goals will help to
design the system in an efficient manner. There are several criteria to identify these goals. Some
of the criteria were explained below:
Performance criteria:
a) Response time: The response time of the method is very low because the system simple
design developed on the high performance system.
b) Throughput: The throughput of the system is high.
c) Memory: memory used by the system is very low.
Dependability criteria:
a) Robustness: the system should be designed to work efficiently on images of any type of
formats without any problem.
b) Availability: the system should be ready to accept command from user at any point of time.
c) Fault Tolerance: the system should not allow the user to work with fault input. It displays
error messages foe every specific fault occurred.
Maintenance criteria:
a) Portability: the system should work on all the platforms like linux, windows.
b) Readability: the code generated should be able to understand the purpose of the project, so
as to make the user to make the modifications easily.
c) Traceability: the code generated should be easy to map with the functions with the
operations selected by the user.
End-user criteria:
a) Utility: the system should be made to operate on al inputs of end-user under any kind of
circumstances. It should complete all the commands or instructions given by user without
any interruptions.
b) Usability: the interface of the user is to be defined with all options which make the work of
the end-user easier.
UML Diagrams
UML stands for Unified Modeling Language. This object-oriented system of notation has
evolved from the work of Grady Booch, James Rumbaugh, Ivar Jacobson, and the Rational
Software Corporation. These renowned computer scientists fused their respective technologies
into a single, standardized model. Today, UML is accepted by the Object Management Group
(OMG)as the standard for modeling object oriented programs.
There are two broad categories of diagrams and then are again divided into sub-categories:
• Structural Diagrams
• Behavioral Diagrams
Structural Diagrams:
The structural diagrams represent the static aspect of the system. These static aspects represent
those parts of a diagram which forms the main structure and therefore stable.
These static parts are represents by classes, interfaces, objects, components and nodes. The four
structural diagrams are:
• Class diagram
• Object diagram
• Component diagram
• Deployment diagram
Class Diagram:
Class diagrams are the most common diagrams used in UML. Class diagram consists of classes,
interfaces, associations and collaboration. Class diagrams basically represent the object oriented
view of a system which is static in nature. Active class is used in a class diagram to represent
the concurrency of the system. Class diagram represents the object orientation of a system. So it
is generally used for development purpose. This is the most widely used diagram at the time of
system construction.
Object Diagram:
Object diagrams can be described as an instance of class diagram. So these diagrams

are more close to real life scenarios where we implement a system. Object diagrams are a set of
objects and their relationships just like class diagrams and also represent the static view of the
system. The usage of object diagrams is similar to class diagrams but they are used to build
prototype of a system from practical perspective.
Component Diagram:
Component diagrams represent a set of components and their relationships. These

components consist of classes, interfaces or collaborations. So Component diagrams represent
the implementation view of a system.
During design phase software artifacts (classes, interfaces etc) of a system are
arranged in different groups depending upon their relationship. Now these groups are known as
components. Finally, component diagrams are used to visualize the implementation.
Deployment Diagram:
Deployment diagrams are a set of nodes and their relationships. These nodes are
physical entities where the components are deployed. Deployment diagrams are used for
visualizing deployment view of a system. This is generally used by the deployment team.
Behavioral Diagrams: Any system can have two aspects, static and dynamic. So a model is
considered as complete when both the aspects are covered fully. Behavioral diagrams basically
capture the dynamic aspect of a system. Dynamic aspect can be further described as the
changing/moving parts of a system.
UML has the following five types of behavioral diagrams:
• Use case diagram
• Sequence diagram
• Collaboration diagram
• State chart diagram
• Activity diagram
Use case Diagram:
Use case diagrams are a set of use cases, actors and their relationships. They represent
the use case view of a system. A use case represents a particular functionality of a system. So
use case diagram is used to describe the relationships among the functionalities and their
internal/external controllers. These controllers are known as actors.
Sequence Diagram:
A sequence diagram is an interaction diagram. From the name it is clear that the
diagram deals with some sequences, which are the sequence of messages flowing from one
object to another. Interaction among the components of a system is very important from
implementation and execution perspective. So Sequence diagram is used to visualize the
sequence of calls in a system to perform a specific functionality.
Collaboration Diagram:
Collaboration diagram is another form of interaction diagram. It represents the structural

organization of a system and the messages sent/received. Structural organization consists of
objects and links.
The purpose of collaboration diagram is similar to sequence diagram. But the specific purpose
of collaboration diagram is to visualize the organization of objects and their interaction.
State chart Diagram:
Any real time system is expected to be reacted by some kind of internal/external

events. These events are responsible for state change of the system. State chart diagram is used
to represent the event driven state change of a system. It basically describes the state change of
a class, interface etc. State chart diagram is used to visualize the reaction of a system by
internal/external factors.
Activity Diagram:
Activity diagram describes the flow of control in a system. So it consists of activities

and links. The flow can be sequential, concurrent or branched. Activities are nothing but the
functions of a system. Numbers of activity diagrams are prepared to capture the entire flow in a
system. Activity diagrams are used to visualize the flow of controls in a system. This is
prepared to have an idea of how the system will work when executed.
Architecture Diagram
USE CASE DIAGRAM:
To model a system the most important aspect is to capture the dynamic behaviour. To
clarify a bit in details, dynamic behaviour means the behaviour of the system when it is
running /operating. So only static behaviour is not sufficient to model a system rather dynamic
behaviour is more important than static behaviour.
In UML there are five diagrams available to model dynamic nature and use case diagram
is one of them. Now as we have to discuss that the use case diagram is dynamic in nature there
should be some internal or external factors for making the interaction. These internal and
external agents are known as actors. So use case diagrams are consists of actors, use cases and
their relationships.
The diagram is used to model the system/subsystem of an application. A single use case
diagram captures a particular functionality of a system. So to model the entire system numbers of
use case diagrams are used. A use case diagram at its simplest is a representation of a user's
interaction with the system and depicting the specifications of a use case. A use case diagram can
portray the different types of users of a system and the case and will often be accompanied by
other types of diagrams as well.
Register
Login
View Query
LBS PROVIDER
LBS user
Generate key
send user
Receive secret key

Admin
Decrypt key
query location
store location
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of
static structure diagram that describes the structure of a system by showing the system's classes,
their attributes, operations (or methods), and the relationships among the classes. It explains
which class contains information.
user
provider
Register
Register
login
Login
Register()
view user()
Login()
view query()
send query()
generate key()
decrypt()
admin.
Login
store location
view location()
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order.
Provider user Admin
Register
login
register
login
send query
view query
decrypt key
view query location
view source destination
store location
.
COLLABORATION DIAGRAM
2: login 4: login
1: Register
6: view query
Provider user
5: send query
7: decrypt key
9: view source destination
3: register
8: view query location
10: store location
Admin
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
Provider user Admin
view provider login

Register
Send query Store location

view provider
query user decrypt
logout
Technology description
Java Technology
Java technology is both a programming language and a Platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all
of the following buzzwords:
i. Simple
ii. Architecture neutral
iii. Object oriented
iv. Portable
v. Distributed
vi. High performance
vii. Interpreted
viii. Multithreaded
ix. Robust
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.
FIGURE 3.1- WORKING OF JAVA

You can think of Java byte codes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java byte codes help make “write
once, run anywhere” possible. You can compile your program into byte codes on any platform
that has a Java compiler. The byte codes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java Platform
A platform is the hardware or software environment in which a program runs. The Java
platform differs from most other platforms in that it’s a software-only platform that runs on top
of other hardware-based platforms.
The Java platform has two components:
a. The Java Virtual Machine (Java VM)
b. The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is
ported onto various hardware-based platforms. The Java API is a large collection of ready-made
software components that provide many useful capabilities, such as graphical user interface
(GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these
libraries are known as packages.The following figure depicts a program that’s running on the
Java platform. As the figure shows, the Java API and the virtual machine insulate the program
from the hardware.
FIGURE 3.2- THE JAVA PLATFORM
Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a bit
slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte
code compilers can bring performance close to that of native code without threatening
portability.
Every full implementation of the Java platform gives you the following features:
i. The essentials: Objects, strings, threads, numbers, input and output, data structures,
system properties, date and time, and so on.
ii. Applets: The set of conventions used by applets.
iii. Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram
Protocol) sockets, and IP (Internet Protocol) addresses.
iv. Internationalization: Help for writing programs that can be localized for users
worldwide. Programs can automatically adapt to specific locales and be displayed in the
appropriate langage.
v. Security: Both low level and high level, including electronic signatures, public and
private key management, access control, and certificates.
vi. Software components: Known as JavaBeansTM, can plug into existing component
architectures.
vii. Object serialization: Allows lightweight persistence and communication via Remote
Method Invocation (RMI).
viii. Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of
relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers,
collaboration, telephony, speech, animation, and more. The following figure depicts what is
included in the Java 2 SDK.
ODBC
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for

application developers and database systems providers. Before ODBC became a de facto
standard for Windows programs to interface with database systems, programmers had to use
proprietary languages for each database they wanted to connect to. Now, ODBC has made the
choice of the database system almost irrelevant from a coding perspective, which is as it should
be. Application developers have much more important things to worry about than the syntax that
is needed to port their program from one database to another when business needs suddenly
change.
Through the ODBC Administrator in Control Panel, you can specify the particular
database that is associated with a data source that an ODBC application program is written to
use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a
particular database. For example, the data source named Sales Figures might be a SQL Server
database, whereas the Accounts Payable data source could refer to an Access database. The
physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they
are installed when you setup a separate database application, such as SQL Server Client or
Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called
ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-
alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program,
and each maintains a separate list of ODBC data sources.
The advantages of this scheme are so numerous that you are probably thinking there must
be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to
the native database interface. ODBC has had many detractors make the charge that it is too slow.
Microsoft has always claimed that the critical factor in performance is the quality of the driver
software that is used. In our humble opinion, this is true. The availability of good ODBC drivers
has improved a great deal recently. And anyway, the criticism about performance is somewhat
analogous to those who said that compilers would never match the speed of pure assembly
language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner
programs, which means you finish sooner. Meanwhile, computers get faster every year.
JDBC Goals:
1.SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although
not the lowest database interface level possible, it is at a low enough level for higher-level tools
and APIs to be created. Conversely, it is at a high enough level for application programmers to
use it confidently.
Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of
JDBC’s complexities from the end user.
2. SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to
support a wide variety of vendors, JDBC will allow any query statement to be passed through it
to the underlying database driver. This allows the connectivity module to handle non-standard
functionality in a manner that is suitable for its users.
3. JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This go allows
JDBC to use existing ODBC level drivers by the use of a software interface. This interface
would translate JDBC calls to ODBC and vice versa.
4. Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they
should not stray from the current design of the core Java system.
5. Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception.
Sun felt that the design of JDBC should be very simple, allowing for only one method of
completing a task per mechanism. Allowing duplicate functionality only serves to confuse the
users of the API.
6. Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less errors appear
at runtime.
7. Keep the common cases simple
Because more often than not, the usual SQL calls used by the programmer are simple
NetBeans:
NetBeans is a software development platform written in Java. The NetBeans Platform

allows applications to be developed from a set of modular software components called modules.
Applications based on the NetBeans Platform, including the NetBeans integrated development
environment (IDE), can be extended by third party developers.
The NetBeans IDE is primarily intended for development in Java, but also supports other
languages, in particular PHP, C/C++ and HTML5.
NetBeans is cross-platform and runs on Microsoft Windows, Mac OS X, Linux, Solaris and
other platforms supporting a compatible JVM.
History:
NetBeans began in 1996 as Xelfi (word play on Delphi),[7][8] a Java IDE student project under the
guidance of the Faculty of Mathematics and Physics at Charles University in Prague. In 1997,
Roman Stank formed a company around the project and produced commercial versions of the
NetBeans IDE until it was bought by Sun Microsystems in 1999. Sun open-sourced the
NetBeans IDE in June of the following year. Since then, the NetBeans community has continued
to grow.[9] In 2010, Sun (and thus NetBeans) was acquired by Oracle Corporation.
NetBeans Platform:
The NetBeans Platform is a framework for simplifying the development of Java Swing desktop
applications. The NetBeans IDE bundle for Java SE contains what is needed to start developing
NetBeans plugins and NetBeans Platform based applications; no additional SDK is required.
Applications can install modules dynamically. Any application can include the Update Center
module to allow users of the application to download digitally signed upgrades and new features
directly into the running application. Reinstalling an upgrade or a new release does not force
users to download the entire application again.The platform offers reusable services common to
desktop applications, allowing developers to focus on the logic specific to their application.
Among the features of the platform are:
i. User interface management (e.g. menus and toolbars)
ii. User settings management
iii. Storage management (saving and loading any kind of data)
iv. Window management
v. Wizard framework (supports step-by-step dialogs)
vi. NetBeans Visual Library
vii. Integrated development tools
NetBeans IDE :
NetBeans IDE is an open-source integrated development environment. NetBeans IDE supports

development of all Java application types (Java SE (including JavaFX), Java ME, web, EJB and
mobile applications) out of the box. Among other features are an Ant-based project system,
Maven support, refactorings, versioncontrol (supporting CVS, Subversion, Git, Mercurial and
Clearcase)
All the functions of the IDE are provided by modules. Each module provides a well-defined
function, such as support for the Java language, editing, or support for the CVS versioning
system, and SVN. NetBeans contains all the modules needed for Java development in a single
download, allowing the user to start working immediately. Modules also allow NetBeans to be
extended. New features, such as support for other programming languages, can be added by
installing additional modules. For instance, Sun Studio, Sun Java Studio Enterprise, and Sun
Java Studio Creator from Sun Microsystems are all based on the NetBeans IDE.
JavaScript and Ajax Development

JavaScript is an object-oriented scripting language primarily used in client-side interfaces for
web applications. Ajax (Asynchronous JavaScript and XML) is a Web 2.0 technique that allows
changes to occur in a web page without the need to perform a page refresh. JavaScript toolkits
can be leveraged to implement Ajax-enabled components and functionality in web pages.
Web Server and Client

Web Server is software that can process the client request and send the response back to the
client. For example, Apache is one of the most widely used web server. Web Server runs on
some physical machine and listens to client request on specific port.
A web client is software that helps in communicating with the server. Some of the most widely
used web clients are Firefox, Google Chrome, Safari etc. When we request something from
server (through URL), web client takes care of creating a request and sending it to server and
then parsing the server response and present it to the user.
HTML and HTTP

Web Server and Web Client are two separate softwares, so there should be some common
language for communication. HTML is the common language between server and client and
stands for HyperText Markup Language.
Web server and client needs a common communication protocol, HTTP (HyperText Transfer
Protocol) is the communication protocol between server and client. HTTP runs on top of TCP/IP
communication protocol.
Some of the important parts of HTTP Request are:
 HTTP Method – action to be performed, usually GET, POST, PUT etc.
 URL – Page to access
 Form Parameters – similar to arguments in a java method, for example user,password
details from login page.
Sample HTTP Request:
1GET /FirstServletProject/jsps/hello.jsp HTTP/1.1
2Host: localhost:8080
3Cache-Control: no-cache
Some of the important parts of HTTP Response are:
 Status Code – an integer to indicate whether the request was success or not. Some of the
well known status codes are 200 for success, 404 for Not Found and 403 for Access
Forbidden.
 Content Type – text, html, image, pdf etc. Also known as MIME type
 Content – actual data that is rendered by client and shown to user.
MIME Type or Content Type: If you see above sample HTTP response header, it contains tag
“Content-Type”. It’s also called MIME type and server sends it to client to let them know the
kind of data it’s sending. It helps client in rendering the data for user. Some of the mostly used
mime types are text/html, text/xml, application/xml etc.
Understanding URL
URL is acronym of Universal Resource Locator and it’s used to locate the server and resource.
Every resource on the web has it’s own unique address. Let’s see parts of URL with an example.
http://localhost:8080/FirstServletProject/jsps/hello.jsp
http:// – This is the first part of URL and provides the communication protocol to be used in
server-client communication.
localhost – The unique address of the server, most of the times it’s the hostname of the server
that maps to unique IP address. Sometimes multiple hostnames point to same IP addresses and
web server virtual host takes care of sending request to the particular server instance.
8080 – This is the port on which server is listening, it’s optional and if we don’t provide it in
URL then request goes to the default port of the protocol. Port numbers 0 to 1023 are reserved
ports for well known services, for example 80 for HTTP, 443 for HTTPS, 21 for FTP etc.
FirstServletProject/jsps/hello.jsp – Resource requested from server. It can be static html, pdf,

JSP, servlets, PHP etc.
Why we need Servlet and JSPs?
Web servers are good for static contents HTML pages but they don’t know how to generate
dynamic content or how to save data into databases, so we need another tool that we can use to
generate dynamic content. There are several programming languages for dynamic content like
PHP, Python, Ruby on Rails, Java Servlets and JSPs.
Java Servlet and JSPs are server side technologies to extend the capability of web servers by
providing support for dynamic response and data persistence.
Web Container
Tomcat is a web container, when a request is made from Client to web server, it passes the
request to web container and it’s web container job to find the correct resource to handle the
request (servlet or JSP) and then use the response from the resource to generate the response and
provide it to web server. Then web server sends the response back to the client.
When web container gets the request and if it’s for servlet then container creates two Objects
HTTPServletRequest and HTTPServletResponse. Then it finds the correct servlet based on the
URL and creates a thread for the request. Then it invokes the servlet service() method and based
on the HTTP method service() method invokes doGet() or doPost() methods. Servlet methods
generate the dynamic page and write it to response. Once servlet thread is complete, container
converts the response to HTTP response and send it back to client.
Some of the important work done by web container are:
 Communication Support – Container provides easy way of communication between
web server and the servlets and JSPs. Because of container, we don’t need to build a
server socket to listen for any request from web server, parse the request and generate
response. All these important and complex tasks are done by container and all we need to
focus is on our business logic for our applications.
 Lifecycle and Resource Management – Container takes care of managing the life cycle
of servlet. Container takes care of loading the servlets into memory, initializing servlets,
invoking servlet methods and destroying them. Container also provides utility like JNDI
for resource pooling and management.
 Multithreading Support – Container creates new thread for every request to the servlet
and when it’s processed the thread dies. So servlets are not initialized for each request
and saves time and memory.
 JSP Support – JSPs doesn’t look like normal java classes and web container provides
support for JSP. Every JSP in the application is compiled by container and converted to
Servlet and then container manages them like other servlets.
 Miscellaneous Task – Web container manages the resource pool, does memory
optimizations, run garbage collector, provides security configurations, support for
multiple applications, hot deployment and several other tasks behind the scene that makes
our life easier.
Coding
package reformance.evaluation;
import weka.classifiers.Classifier;
import weka.classifiers.Sourcable;
import weka.classifiers.trees.j48.BinC45ModelSelection;
import weka.classifiers.trees.j48.C45ModelSelection;
import weka.classifiers.trees.j48.C45PruneableClassifierTree;
import weka.classifiers.trees.j48.ClassifierTree;
import weka.classifiers.trees.j48.ModelSelection;
import weka.classifiers.trees.j48.PruneableClassifierTree;
import weka.core.AdditionalMeasureProducer;
import weka.core.Capabilities;
import weka.core.Drawable;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.Matchable;
import weka.core.Option;
import weka.core.OptionHandler;
import weka.core.RevisionUtils;
import weka.core.Summarizable;
import weka.core.TechnicalInformation;
import weka.core.TechnicalInformationHandler;
import weka.core.Utils;
import weka.core.WeightedInstancesHandler;
import weka.core.TechnicalInformation.Field;
import weka.core.TechnicalInformation.Type;
import java.util.Enumeration;
import java.util.Vector;
/**

* Class for generating a pruned or unpruned C4.5 decision tree. For more information, see 
* 
* Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
San Mateo, CA.
* 

*

* BibTeX:
* <pre>
* @book{Quinlan1993,
* address = {San Mateo, CA},
* author = {Ross Quinlan},
* publisher = {Morgan Kaufmann Publishers},
* title = {C4.5: Programs for Machine Learning},
* year = {1993}
*}
* </pre>
* 

*

* Valid options are: 
*
* <pre> -U
* Use unpruned tree.</pre>
*
* <pre> -C <pruning confidence>
* Set confidence threshold for pruning.
* (default 0.25)</pre>
*
* <pre> -M <minimum number of instances>
* Set minimum number of instances per leaf.
* (default 2)</pre>
*
* <pre> -R
* Use reduced error pruning.</pre>
*
* <pre> -N <number of folds>
* Set number of folds for reduced error
* pruning. One fold is used as pruning set.
* (default 3)</pre>
*
* <pre> -B
* Use binary splits only.</pre>
*
* <pre> -S
* Don't perform subtree raising.</pre>
*
* <pre> -L
* Do not clean up after the tree has been built.</pre>
*
* <pre> -A
* Laplace smoothing for predicted probabilities.</pre>
*
* <pre> -Q <seed>
* Seed for random data shuffling (default 1).</pre>
*

*
* @author Eibe Frank (eibe@cs.waikato.ac.nz)
* @version $Revision: 1.9 $
*/
public class C45
extends Classifier
implements OptionHandler, Drawable, Matchable, Sourcable,
WeightedInstancesHandler, Summarizable, AdditionalMeasureProducer,
TechnicalInformationHandler {
/** for serialization */

static final long serialVersionUID = -217733168393644444L;
/** The decision tree */

private ClassifierTree m_root;
/** Unpruned tree? */

private boolean m_unpruned = false;
/** Confidence level */

private float m_CF = 0.25f;
/** Minimum number of instances */

private int m_minNumObj = 2;
/** Determines whether probabilities are smoothed using

Laplace correction when predictions are generated */
private boolean m_useLaplace = false;
/** Use reduced error pruning? */
private boolean m_reducedErrorPruning = false;
/** Number of folds for reduced error pruning. */

private int m_numFolds = 3;
/** Binary splits on nominal attributes? */

private boolean m_binarySplits = false;
/** Subtree raising to be performed? */

private boolean m_subtreeRaising = true;
/** Cleanup after the tree has been built. */

private boolean m_noCleanup = false;
/** Random number seed for reduced-error pruning. */

private int m_Seed = 1;
/**
* Returns a string describing classifier
* @return a description suitable for
* displaying in the explorer/experimenter gui
*/
public String globalInfo() {
return "Class for generating a pruned or unpruned C4.5 decision tree. For more "
+ "information, see\n\n"
+ getTechnicalInformation().toString();
}
/**
* Returns an instance of a TechnicalInformation object, containing
* detailed information about the technical background of this class,
* e.g., paper reference or book this class is based on.
*
* @return the technical information about this class
*/
public TechnicalInformation getTechnicalInformation() {
TechnicalInformation result;
result = new TechnicalInformation(Type.BOOK);

result.setValue(Field.AUTHOR, "Ross Quinlan");
result.setValue(Field.YEAR, "1993");
result.setValue(Field.TITLE, "C4.5: Programs for Machine Learning");
result.setValue(Field.PUBLISHER, "Morgan Kaufmann Publishers");
result.setValue(Field.ADDRESS, "San Mateo, CA");
return result;
}
/**
* Returns default capabilities of the classifier.
*
* @return the capabilities of this classifier
*/
public Capabilities getCapabilities() {
Capabilities result;
try {
if (!m_reducedErrorPruning)
result = new C45PruneableClassifierTree(null, !m_unpruned, m_CF, m_subtreeRaising, !
m_noCleanup).getCapabilities();
else
result = new PruneableClassifierTree(null, !m_unpruned, m_numFolds, !m_noCleanup,
m_Seed).getCapabilities();
}
catch (Exception e) {
result = new Capabilities(this);
}
result.setOwner(this);
return result;
}
/**
* Generates the classifier.
*
* @param instances the data to train the classifier with
* @throws Exception if classifier can't be built successfully
*/
public void buildClassifier(Instances instances)
throws Exception {
ModelSelection modSelection;
if (m_binarySplits)
modSelection = new BinC45ModelSelection(m_minNumObj, instances);
else
modSelection = new C45ModelSelection(m_minNumObj, instances);
if (!m_reducedErrorPruning)
m_root = new C45PruneableClassifierTree(modSelection, !m_unpruned, m_CF,
m_subtreeRaising, !m_noCleanup);
else
m_root = new PruneableClassifierTree(modSelection, !m_unpruned, m_numFolds,
!m_noCleanup, m_Seed);
m_root.buildClassifier(instances);
if (m_binarySplits) {
((BinC45ModelSelection)modSelection).cleanup();
} else {
((C45ModelSelection)modSelection).cleanup();
}
}
/**
* Classifies an instance.
*
* @param instance the instance to classify
* @return the classification for the instance
* @throws Exception if instance can't be classified successfully
*/
public double classifyInstance(Instance instance) throws Exception {
return m_root.classifyInstance(instance);
}
/**
* Returns class probabilities for an instance.
*
* @param instance the instance to calculate the class probabilities for
* @return the class probabilities
* @throws Exception if distribution can't be computed successfully
*/
public final double [] distributionForInstance(Instance instance)
throws Exception {
return m_root.distributionForInstance(instance, m_useLaplace);

}
/**
* Returns the type of graph this classifier
* represents.
* @return Drawable.TREE
*/
public int graphType() {
return Drawable.TREE;
}
/**
* Returns graph describing the tree.
*
* @return the graph describing the tree
* @throws Exception if graph can't be computed
*/
public String graph() throws Exception {
return m_root.graph();
}
/**
* Returns tree in prefix order.
*
* @return the tree in prefix order
* @throws Exception if something goes wrong
*/
public String prefix() throws Exception {
return m_root.prefix();
}
/**
* Returns tree as an if-then statement.
*
* @param className the name of the Java class
* @return the tree as a Java if-then type statement
* @throws Exception if something goes wrong
*/
public String toSource(String className) throws Exception {
StringBuffer [] source = m_root.toSource(className);

return
"class " + className + " {\n\n"
+" public static double classify(Object[] i)\n"
+" throws Exception {\n\n"
+" double p = Double.NaN;\n"
+ source[0] // Assignment code
+" return p;\n"
+" }\n"
+ source[1] // Support code
+"}\n";
}
/**
* Returns an enumeration describing the available options.
*
* Valid options are: 
*
* -U 
* Use unpruned tree.
*
* -C confidence 
* Set confidence threshold for pruning. (Default: 0.25) 
*
* -M number 
* Set minimum number of instances per leaf. (Default: 2) 
*
* -R 
* Use reduced error pruning. No subtree raising is performed. 
*
* -N number 
* Set number of folds for reduced error pruning. One fold is
* used as the pruning set. (Default: 3) 
*
* -B 
* Use binary splits for nominal attributes. 
*
* -S 
* Don't perform subtree raising. 
*
* -L 
* Do not clean up after the tree has been built.
*
* -A 
* If set, Laplace smoothing is used for predicted probabilites. 
*
* -Q 
* The seed for reduced-error pruning. 
*
* @return an enumeration of all the available options.
*/
public Enumeration listOptions() {
Vector newVector = new Vector(9);
newVector.
addElement(new Option("\tUse unpruned tree.",
"U", 0, "-U"));
newVector.
addElement(new Option("\tSet confidence threshold for pruning.\n" +
"\t(default 0.25)",
"C", 1, "-C <pruning confidence>"));
newVector.
addElement(new Option("\tSet minimum number of instances per leaf.\n" +
"\t(default 2)",
"M", 1, "-M <minimum number of instances>"));
newVector.
addElement(new Option("\tUse reduced error pruning.",
"R", 0, "-R"));
newVector.
addElement(new Option("\tSet number of folds for reduced error\n" +
"\tpruning. One fold is used as pruning set.\n" +
"\t(default 3)",
"N", 1, "-N <number of folds>"));
newVector.
addElement(new Option("\tUse binary splits only.",
"B", 0, "-B"));
newVector.
addElement(new Option("\tDon't perform subtree raising.",
"S", 0, "-S"));
newVector.
addElement(new Option("\tDo not clean up after the tree has been built.",
"L", 0, "-L"));
newVector.
addElement(new Option("\tLaplace smoothing for predicted probabilities.",
"A", 0, "-A"));
newVector.
addElement(new Option("\tSeed for random data shuffling (default 1).",
"Q", 1, "-Q <seed>"));
return newVector.elements();
}
/**
* Parses a given list of options.
*

* Valid options are: 
*
* <pre> -U
* Use unpruned tree.</pre>
*
* <pre> -C <pruning confidence>
* Set confidence threshold for pruning.
* (default 0.25)</pre>
*
* <pre> -M <minimum number of instances>
* Set minimum number of instances per leaf.
* (default 2)</pre>
*
* <pre> -R
* Use reduced error pruning.</pre>
*
* <pre> -N <number of folds>
* Set number of folds for reduced error
* pruning. One fold is used as pruning set.
* (default 3)</pre>
*
* <pre> -B
* Use binary splits only.</pre>
*
* <pre> -S
* Don't perform subtree raising.</pre>
*
* <pre> -L
* Do not clean up after the tree has been built.</pre>
*
* <pre> -A
* Laplace smoothing for predicted probabilities.</pre>
*
* <pre> -Q <seed>
* Seed for random data shuffling (default 1).</pre>
*

*
* @param options the list of options as an array of strings
* @throws Exception if an option is not supported
*/
public void setOptions(String[] options) throws Exception {
// Other options
String minNumString = Utils.getOption('M', options);
if (minNumString.length() != 0) {
m_minNumObj = Integer.parseInt(minNumString);
} else {
m_minNumObj = 2;
}
m_binarySplits = Utils.getFlag('B', options);
m_useLaplace = Utils.getFlag('A', options);
// Pruning options
m_unpruned = Utils.getFlag('U', options);
m_subtreeRaising = !Utils.getFlag('S', options);
m_noCleanup = Utils.getFlag('L', options);
if ((m_unpruned) && (!m_subtreeRaising)) {
throw new Exception("Subtree raising doesn't need to be unset for unpruned tree!");
}
m_reducedErrorPruning = Utils.getFlag('R', options);
if ((m_unpruned) && (m_reducedErrorPruning)) {
throw new Exception("Unpruned tree and reduced error pruning can't be selected " +
"simultaneously!");
}
String confidenceString = Utils.getOption('C', options);
if (confidenceString.length() != 0) {
if (m_reducedErrorPruning) {
throw new Exception("Setting the confidence doesn't make sense " +
"for reduced error pruning.");
} else if (m_unpruned) {
throw new Exception("Doesn't make sense to change confidence for unpruned "
+"tree!");
} else {
m_CF = (new Float(confidenceString)).floatValue();
if ((m_CF <= 0) || (m_CF >= 1)) {
throw new Exception("Confidence has to be greater than zero and smaller " +
"than one!");
}
}
} else {
m_CF = 0.25f;
}
String numFoldsString = Utils.getOption('N', options);
if (numFoldsString.length() != 0) {
if (!m_reducedErrorPruning) {
throw new Exception("Setting the number of folds" +
" doesn't make sense if" +
" reduced error pruning is not selected.");
} else {
m_numFolds = Integer.parseInt(numFoldsString);
}
} else {
m_numFolds = 3;
}
String seedString = Utils.getOption('Q', options);
if (seedString.length() != 0) {
m_Seed = Integer.parseInt(seedString);
} else {
m_Seed = 1;
}
}
/**
* Gets the current settings of the Classifier.
*
* @return an array of strings suitable for passing to setOptions
*/
public String [] getOptions() {
String [] options = new String [14];

int current = 0;
if (m_noCleanup) {
options[current++] = "-L";
}
if (m_unpruned) {
options[current++] = "-U";
} else {
if (!m_subtreeRaising) {
options[current++] = "-S";
}
if (m_reducedErrorPruning) {
options[current++] = "-R";
options[current++] = "-N"; options[current++] = "" + m_numFolds;
options[current++] = "-Q"; options[current++] = "" + m_Seed;
} else {
options[current++] = "-C"; options[current++] = "" + m_CF;
}
}
if (m_binarySplits) {
options[current++] = "-B";
}
options[current++] = "-M"; options[current++] = "" + m_minNumObj;
if (m_useLaplace) {
options[current++] = "-A";
}
while (current < options.length) {

options[current++] = "";
}
return options;
}
/**
* Returns the tip text for this property
* @return tip text for this property suitable for
*/
public String seedTipText() {
return "The seed used for randomizing the data " +
"when reduced-error pruning is used.";
}
/**
* Get the value of Seed.
*
* @return Value of Seed.
*/
public int getSeed() {
return m_Seed;
}
/**
* Set the value of Seed.
*
* @param newSeed Value to assign to Seed.
*/
public void setSeed(int newSeed) {
m_Seed = newSeed;
}
/**
*/
public String useLaplaceTipText() {
return "Whether counts at leaves are smoothed based on Laplace.";
}
/**
* Get the value of useLaplace.
*
* @return Value of useLaplace.
*/
public boolean getUseLaplace() {
return m_useLaplace;
}
/**
* Set the value of useLaplace.
*
* @param newuseLaplace Value to assign to useLaplace.
*/
public void setUseLaplace(boolean newuseLaplace) {
m_useLaplace = newuseLaplace;
}
/**
* Returns a description of the classifier.
*
* @return a description of the classifier
*/
public String toString() {
if (m_root == null) {
return "No classifier built";
}
if (m_unpruned)
return "J48 unpruned tree\n------------------\n" + m_root.toString();
else
return "J48 pruned tree\n------------------\n" + m_root.toString();
}
/**
* Returns a superconcise version of the model
*
* @return a summary of the model
*/
public String toSummaryString() {
return "Number of leaves: " + m_root.numLeaves() + "\n"

+ "Size of the tree: " + m_root.numNodes() + "\n";
}
/**
* Returns the size of the tree
* @return the size of the tree
*/
public double measureTreeSize() {
return m_root.numNodes();
}
/**
* Returns the number of leaves
* @return the number of leaves
*/
public double measureNumLeaves() {
return m_root.numLeaves();
}
/**
* Returns the number of rules (same as number of leaves)
* @return the number of rules
*/
public double measureNumRules() {
return m_root.numLeaves();
}
/**
* Returns an enumeration of the additional measure names
* @return an enumeration of the measure names
*/
public Enumeration enumerateMeasures() {
Vector newVector = new Vector(3);
newVector.addElement("measureTreeSize");
newVector.addElement("measureNumLeaves");
newVector.addElement("measureNumRules");
return newVector.elements();
}
/**
* Returns the value of the named measure
* @param additionalMeasureName the name of the measure to query for its value
* @return the value of the named measure
* @throws IllegalArgumentException if the named measure is not supported
*/
public double getMeasure(String additionalMeasureName) {
if (additionalMeasureName.compareToIgnoreCase("measureNumRules") == 0) {
return measureNumRules();
} else if (additionalMeasureName.compareToIgnoreCase("measureTreeSize") == 0) {
return measureTreeSize();
} else if (additionalMeasureName.compareToIgnoreCase("measureNumLeaves") == 0) {
return measureNumLeaves();
} else {
throw new IllegalArgumentException(additionalMeasureName
+ " not supported (j48)");
}
}
/**
*/
public String unprunedTipText() {
return "Whether pruning is performed.";
}
/**
* Get the value of unpruned.
*
* @return Value of unpruned.
*/
public boolean getUnpruned() {
return m_unpruned;
}
/**
* Set the value of unpruned. Turns reduced-error pruning
* off if set.
* @param v Value to assign to unpruned.
*/
public void setUnpruned(boolean v) {
if (v) {
m_reducedErrorPruning = false;
}
m_unpruned = v;
}
/**
*/
public String confidenceFactorTipText() {
return "The confidence factor used for pruning (smaller values incur "
+ "more pruning).";
}
/**
* Get the value of CF.
*
* @return Value of CF.
*/
public float getConfidenceFactor() {
return m_CF;
}
/**
* Set the value of CF.
*
* @param v Value to assign to CF.
*/
public void setConfidenceFactor(float v) {
m_CF = v;
}
/**
*/
public String minNumObjTipText() {
return "The minimum number of instances per leaf.";
}
/**
* Get the value of minNumObj.
*
* @return Value of minNumObj.
*/
public int getMinNumObj() {
return m_minNumObj;
}
/**
* Set the value of minNumObj.
*
* @param v Value to assign to minNumObj.
*/
public void setMinNumObj(int v) {
m_minNumObj = v;
}
/**
*/
public String reducedErrorPruningTipText() {
return "Whether reduced-error pruning is used instead of C.4.5 pruning.";
}
/**
* Get the value of reducedErrorPruning.
*
* @return Value of reducedErrorPruning.
*/
public boolean getReducedErrorPruning() {
return m_reducedErrorPruning;
}
/**
* Set the value of reducedErrorPruning. Turns
* unpruned trees off if set.
*
* @param v Value to assign to reducedErrorPruning.
*/
public void setReducedErrorPruning(boolean v) {
if (v) {
m_unpruned = false;
}
m_reducedErrorPruning = v;
}
/**
*/
public String numFoldsTipText() {
return "Determines the amount of data used for reduced-error pruning. "
+ " One fold is used for pruning, the rest for growing the tree.";
}
/**
* Get the value of numFolds.
*
* @return Value of numFolds.
*/
public int getNumFolds() {
return m_numFolds;
}
/**
* Set the value of numFolds.
*
* @param v Value to assign to numFolds.
*/
public void setNumFolds(int v) {
m_numFolds = v;
}
/**
*/
public String binarySplitsTipText() {
return "Whether to use binary splits on nominal attributes when "
+ "building the trees.";
}
/**
* Get the value of binarySplits.
*
* @return Value of binarySplits.
*/
public boolean getBinarySplits() {
return m_binarySplits;
}
/**
* Set the value of binarySplits.
*
* @param v Value to assign to binarySplits.
*/
public void setBinarySplits(boolean v) {
m_binarySplits = v;
}
/**
*/
public String subtreeRaisingTipText() {
return "Whether to consider the subtree raising operation when pruning.";
}
/**
* Get the value of subtreeRaising.
*
* @return Value of subtreeRaising.
*/
public boolean getSubtreeRaising() {
return m_subtreeRaising;
}
/**
* Set the value of subtreeRaising.
*
* @param v Value to assign to subtreeRaising.
*/
public void setSubtreeRaising(boolean v) {
m_subtreeRaising = v;
}
/**
*/
public String saveInstanceDataTipText() {
return "Whether to save the training data for visualization.";
}
/**
* Check whether instance data is to be saved.
*
* @return true if instance data is saved
*/
public boolean getSaveInstanceData() {
return m_noCleanup;
}
/**
* Set whether instance data is to be saved.
* @param v true if instance data is to be saved
*/
public void setSaveInstanceData(boolean v) {
m_noCleanup = v;
}
/**
* Returns the revision string.
*
* @return the revision
*/
public String getRevision() {
return RevisionUtils.extract("$Revision: 1.9 $");
}
/**
* Main method for testing this class
*
* @param argv the commandline options
*/
public static void main(String [] argv){
argv[0]="weather3.arff";
runClassifier(new C45(),argv );
}
TESTING
TESTING
Software testing is an investigation conducted to provide stakeholders with

information about the quality of the product or service under test. [1]Software testing can also
provide an objective, independent view of the softwareto allow the business to appreciate and
understand the risks of software implementation. Test techniques include, but are not limited to
the process of executing a program or application with the intent of finding software bugs(errors
or other defects).
Software testing can be stated as the process of validating and verifying that a computer
program/application/product:
• It meets the requirements that guided its design and development,
• It works as expected,
• It can be implemented with the same characteristics, It satisfies the needs of
stakeholders.
Software testing, depending on the testing method employed, can be implemented at any time in
the software development process.
Testing levels
There are generally four recognized levels of tests: unit testing, integration testing,
system testing, and acceptance testing. Tests are frequently grouped by where they are added in
the software development process, or by the level of specificity of the test.
Unit testing
Unit testing, also known as component testing, refers to tests that verify the functionality of a
specific section of code, usually at the function level. In an object-oriented environment, this is
usually at the class level, and the minimal unit tests include the constructors and destructors.[32]
These types of tests are usually written by developers as they work on code (white-box style), to
ensure that the specific function is working as expected. One function might have multiple tests,
to catch corner casesor other branches in the code. Unit testing alone cannot verify the
functionality of a piece of software, but rather is used to assure that the building blocks the
software uses work independently of each other.
Unit testing is a software development process that involves synchronized application of a

broad spectrum of defect prevention and detection strategies in order to reduce software
development risks, time, and costs. It is performed by the software developer or engineer during
the construction phase of the software development lifecycle. Rather than replace traditional
QA focuses, it augments it. Unit testing aims to eliminate construction errors before code is
promoted to QA; this strategy is intended to increase the quality of the resulting software as
well as the efficiency of the overall development and QA process.
Integration testing
Integration testing is any type of software testing that seeks to verify the interfaces
between components against a software design. Software components may be integrated in an
iterative way or all together. Normally the former is considered a better practice since it allows
interface issues to be located more quickly and fixed.
Integration testing works to expose defects in the interfaces and interaction between integrated
components (modules). Progressively larger groups of tested software components
corresponding to elements of the architectural design are integrated and tested until the software
works as a system.
System testing
System testing, or end-to-end testing, tests a completely integrated system to verify

that it meets its requirements. For example, a system test might involve testing a logon
interface, then creating and editing an entry, plus sending or printing results, followed by
summary processing or deletion (or archiving) of entries, then logoff.
In addition, the software testing should ensure that the program, as well as working as expected,
does not also destroy or partially corrupt its operating environment or cause other processes
within that environment to become inoperative this includes not corrupting shared memory, not
consuming or locking up excessive resources and leaving any parallel processes unharmed by
its presence.
Testing Types:
Installation testing
An installation test assures that the system is installed correctly and working at actual
customer's hardware.
Compatibility testing
A common cause of software failure (real or perceived) is a lack of its compatibilitywith

other application software,operating systems(or operating system versions, old or new), or
target environments that differ greatly from the original (such as a terminalor GUIapplication
intended to be run on the desktopnow being required to become a web application, which must
render in a web browser). For example, in the case of a lack of backward compatibility, this can
occur because the programmers develop and test software only on the latest version of the target
environment, which not all users may be running. This result in the unintended consequence
that the latest work may not function on earlier versions of the target environment or on older
hardware that earlier versions of the target environment was capable of using.
Smoke and Sanity Testing
Sanity testingdetermines whether it is reasonable to proceed with further testing.
Smoke testingconsists of minimal attempts to operate the software, designed to determine

whether there are any basic problems that will prevent it from working at all. Such tests can be
used as build verification test.
Regression testing
Regression testing focuses on finding defects after a major code change has occurred.
Specifically, it seeks to uncover software regressions, as degraded or lost features, including old
bugs that have come back. Such regressions occur whenever software functionality that was
previously working, correctly, stops working as intended. Typically, regressions occur as an
unintended consequenceof program changes, when the newly developed part of the software
collides with the previously existing code. Common methods of regression testing include
rerunning previous sets of test-cases and checking whether previously fixed faults have
reemerged.
Acceptance Testing
Acceptance testing can mean one of two things:
1. A smoke testis used as an acceptance test prior to introducing a new build to the main
testing process, i.e. before integrationor regression.
2. Acceptance testing performed by the customer, often in their lab environment on their
own hardware, is known as user acceptance testing(UAT). Acceptance testing may be
performed as part of the hand-off process between any two phases of development.
Alpha testing
Alpha testing is simulated or actual operational testing by potential users/customers or an

independent test team at the developers' site.
Beta Testing
Beta testing comes after alpha testing and can be considered a form of external user acceptance
testing. Versions of the software, known as beta versions, are released to a limited audience
outside of the programming team. The software is released to groups of people so that further
testing can ensure the product has few faults or bugs. Sometimes, beta versions are made
available to the open public to increase the feedbackfield to a maximal number of future users.
Functional Vs Non-Functional Testing
Functional testing refers to activities that verify a specific action or function of the
code. These are usually found in the code requirements documentation, although some
development methodologies work from use cases or user stories. Functional tests tend to answer
the question of "can the user do this" or "does this particular feature work."
Non-functional testing refers to aspects of the software that may not be related to a
specific function or user action, such as scalabilityor other performance, behavior under certain
constraints, or security. Testing will determine the breaking point, the point at which extremes
of scalability or performance leads to unstable execution.
Test Case Reports

Test Case# : 1 Priority(H,L): High
Test Objective: Please select dataset.
Test Description: browse dataset is checked.
Requirements Verified: Dataset is checked in the database
Test Environment: Internet Explorer/Frefox
Test setup or Pre-conditions: User initiates any dataset control like Browse
button
Actions Expected Results
If the dataset already exists A message “dataset already exists.” is
displayed. all the required fields are
entered correctly
Pass: Yes Conditional Pass: Fail:
PrP Problems or issues: Nil
Test Objective: Please select dataset for prediction testing
Test Description: browse dataset is checked
Requirements Verified: Dataset is checked in the database
Test Environment: Jframe, swing
Test setup or Pre-conditions: User initiates any dataset control like Browse button.
If the dataset already exists A message “dataset already exists.” is
displayed. all the required fields are
entered correctly.


Test Objective: Please select datasets before predictions
Test Description: Please select datasets in both dataset is checked.
Requirements Verified: dataset in both parties data are checked in the database
Test Environment: Internet Explorer/Frefox
Test setup or Pre-conditions: User initiates any dataset control like Browse button
If the dataset already exists2 dataset in A message “2 dataset in both parties data.” is
both parties data selected or not displayed. if all the required fields are entered
correctly
- Output screens
Conclusion
This project introduces a new, fast clustering algorithm, called FensiVAT, which can be used to
cluster large volumes of high dimensional data. FensiVAT integrates a new random projection
based distance matrix ensemble method with Maxim in and Random sampling (MMRS) and a
visual assessment of cluster tendency method. We showed that the samples obtained using
MMRS sampling in the down space dimension (Near-MMRS sampling) retain the same
geometry in the down space as samples in the up space. This enables us to use random projection
effectively with MMRS sampling and in our ensemble method to reduce the computation time.
We demonstrated the superiority of our FensiVAT approach by comparing it with nine state-of-
the-art approaches on two Gaussian mixture datasets and six real datasets which have both large
sample size and high dimensions. Our experimental results on eight large, high-dimensional
datasets show that FensiVAT almost always outperforms the other nine approaches. FensiVAT is
an order of magnitude faster than clusiVAT, and several order of magnitudes faster than the
other nine approaches (except MBKM), without compromising accuracy
Bibliography
Good Teachers are worth more than thousand books, we have them in Our Department
References Made From:
1. User Interfaces in C#: Windows Forms and Custom Controls by Matthew MacDonald.
2. Applied Microsoft® .NET Framework Programming (Pro-Developer) by Jeffrey Richter.
3. Practical .Net2 and C#2: Harness the Platform, the Language, and the Framework by Patrick
Smacchia.
4. Data Communications and Networking, by Behrouz A Forouzan.
5. Computer Networking: A Top-Down Approach, by James F. Kurose.
6. Operating System Concepts, by Abraham Silberschatz.
7. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A.
Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “Above the clouds: A berkeley view of cloud
computing,” University of California, Berkeley, Tech. Rep. USB-EECS-2009-28, Feb 2009.
8. “The apache cassandra project,” http://cassandra.apache.org/.
9. L. Lamport, “The part-time parliament,” ACM Transactions on Computer Systems, vol. 16,
pp. 133–169, 1998.
10. N. Bonvin, T. G. Papaioannou, and K. Aberer, “Cost-efficient and differentiated data
availability guarantees in data clouds,” in Proc. of the ICDE, Long Beach, CA, USA, 2010.
11. O. Regev and N. Nisan, “The popcorn market. online markets for computational resources,”
Decision Support Systems, vol. 28, no. 1-2, pp. 177 – 189, 2000.
12. A. Helsinger and T. Wright, “Cougaar: A robust configurable multi agent platform,” in Proc.
of the IEEE Aerospace Conference, 2005.
Sites Referred:
http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.ieee.org
http://www.emule-project.net/
REFERENCES
[1] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996.
[2] Hoppner, F.; Klawnn, F.; Kruse, R.; Runkler, T.;. Fuzzy Cluster
[3] Analysis: “Methods for classification data analysis and image recognition” John Wiley &
Sons Inc. New York NY., 2000.
[4] H. Gunadi “Comparing nearest neighbor algorithms in highdimensional space ” 2011.
[5] T. C. Havens and J. C. Bezdek “An efficient formulation of the improved visual assessment
of cluster tendency (iVAT) algorithm ” IEEE Trans. Knowl. Data Eng. vol. 24, no. 5, pp.
813–822, May 2012.
[6] D. Kumar, M. Palaniswami, S. Rajasegarar, C. Leckie, J. C. Bezdek and T. C. Havens
“clusiVAT: A mixed visual/numerical clustering algorithm for big data ” in Proc. IEEE Int.
Conf. Big Data, pp. 112–117, 2013.Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y.
Zomaya,
[7] S. Foufou and A. Bouras “A survey of clustering algorithms for big data: Taxonomy and
empirical analysis ” IEEE Trans. Emerging Topics Comput., vol. 2, no. 3, pp. 267–279, Sep.
2014.
[8] M. Popescu, J. Keller, J. Bezdek and A. Zare “Random projections fuzzy c-means (RPFCM)
for big data clustering ” in Proc. IEEE Int. Conf. Fuzzy Syst., pp. 1–6, 2015.

A Rapid Hybird Clustring Algorithm For A Large Volumes of High

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Rapid Hybird Clustring Algorithm For A Large Volumes of High

Uploaded by

Copyright:

Available Formats

A RAPID HYBIRD CLUSTRING ALGORITHM FOR A LARGE

VOLUMES OF HIGH DIMENSIONAL DATA

clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering

of the ones tested.

1. Privacy-preserving data publishing,

clustering algorithm called FensiVAT. FensiVAT is a hybrid, ensemble-based clustering

of the ones tested

Overview of the system

Economic Feasibility or Cost-benefit is an assessment of the economic justification for a

A functional requirement defines a function of a software system or its component. A function is

Store: given input will be stored and get from databases.

Output: output will displayed depend in mining algorithms.

Supportability: As this application is made up of Java resources, it should not be a problem

Processor : Any Processor above 500 Mhz.

Hard Disk : 10 Gb.

Compact Disk : 650 Mb.

Input device : Standard Keyboard and Mouse.

Output device : VGA and High Resolution Monitor.

Operating System : Windows 2000 server Family.

Front End : JSP

Identifying Design Goals

b) Throughput: The throughput of the system is high.

c) Memory: memory used by the system is very low.

Object diagrams can be described as an instance of class diagram. So these diagrams

Component diagrams represent a set of components and their relationships. These

UML has the following five types of behavioral diagrams:

• Use case diagram

• State chart diagram

Use case Diagram:

Collaboration diagram is another form of interaction diagram. It represents the structural

Any real time system is expected to be reacted by some kind of internal/external

Activity diagram describes the flow of control in a system. So it consists of activities

Receive secret key

Provider user Admin

view query location

view source destination

Provider user Admin

view provider login

Send query Store location

query user decrypt

Java technology is both a programming language and a Platform.

The Java Programming Language

FIGURE 3.1- WORKING OF JAVA

The Java Platform

The Java platform has two components:

a. The Java Virtual Machine (Java VM)

b. The Java Application Programming Interface (Java API)

FIGURE 3.2- THE JAVA PLATFORM

ii. Applets: The set of conventions used by applets.

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for

1.SQL Level API

3. JDBC must be implemental on top of common database interfaces

6. Use strong, static typing wherever possible

7. Keep the common cases simple

NetBeans is a software development platform written in Java. The NetBeans Platform

vi. NetBeans Visual Library

vii. Integrated development tools

NetBeans IDE is an open-source integrated development environment. NetBeans IDE supports

JavaScript and Ajax Development

Web Server and Client

HTML and HTTP

FirstServletProject/jsps/hello.jsp – Resource requested from server. It can be static html, pdf,

/** for serialization */