You are on page 1of 17

Software Requirements Specification

for
Document Clustering based on Similarity Measure Using Multi-Reference points

Version 1.

!repared by
Maram "agar#una Reddy $11%&1'()11* maram.nreddy+gmail.com ,-S College of .ngineering 'nd /ec0nology

Under t0e esteemed guidance of


!rof. 1. 2a3s0mi /ulasiM./ec0.$!0.D* 45 6ebruary 4 17

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page ii

Table of Contents
1. Introduction................................................................................................................................1 2. Overall Description....................................................................................................................4 3. External Interface Requirements............................................................................................. 7 4. System Features......................................................................................................................... 8 5. Other Nonfunctional Requirements.......................................................................................11 6. Other Requirements................................................................................................................ 13

Revision History
Name Date Reason For Changes Version

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 1

1.

Introduction
The main goal of the requirement phase is to produce the software requirement

specification (SRS), which accurately capture the clients requirements. SRS is a document that describes what the software should do. The basic purpose of SRS is to bridge the communication gap between the clients, the end users and the Software developers. nother purpose is helping user to understand their own needs. !lustering is the classification of ob"ects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait # often pro$imity according to some defined distance measure. %ata clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. The computational tas& of classifying the data set into & clusters is often referred to as &#clustering. 'n recent years, due to the increased availability of large document collections and the need to efficiently operate on them (e.g., navigate, analy(e, query, and summari(e), there has been an increased emphasis on developing efficient and effective clustering algorithms for large document collections. To a large e$tent, this research has focused (or assumed) that each document is part of a single topic. This assumption is in general true for short documents (e.g.,web#pages) but it does not hold for many of the large document for which clustering algorithms have been increasingly applied.

1.1

Purpose
The purpose of Software Requirements Specification (SRS) document is to describe the

e$ternal behavior of the %ocument !lustering based on Similarity )easure *sing )ulti#Reference points. The SRS typically contains the brief description of the pro"ect. The purpose of the requirement document is to specify all the information required to design, develop and test the software.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 2

The purpose of this pro"ect is to group, in an unsupervised way, a given document set into clusters such that documents within each cluster are more similar between each other than those in different clusters. The main purpose of this pro"ect is to organi(e a collection of patterns into clusters based on similarity )easure *sing )ulti#Reference points

1.2

Document Conventions
rial font si(e ,,, or ,-

'n general this document follows the '+++ formatting requirements. *se

throughout the document for te$t. *se italics for comments. %ocument te$t should be single spaced and maintain the ,. margins found in this template. /or Section and Subsection titles please follow the template. The template standards are published in 0'+++ Standards !ollection,. and can be downloaded from www.csc.8illano8a.edu9:tway9courses9csc%1)19...9 srs;template-1.doc

1.3

Intended Audience and Reading Suggestions

This SRS document is intended for users, developers, testers, documentation writers. The rest of the SRS is organi(ed as follows. Section - briefly discusses 1verall %escription and also describes the design constraints that are to be considered when the system is to be designed, and other factors necessary to provide a complete and comprehensive description of the requirements for the software . Section 2 describes the nonfunctional requirements such as various interfaces, Section 3 presents system features and its descriptions. Section 4 describes the nonfunctional requirements such as various interfaces, 5erformance Requirements, Safety Requirements etc Requirements Specification which defines and describes the operations, interfaces, performance, and quality assurance requirements of the %ocument !lustering *sing )ulti# Reference points.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 3

1.4

Product Scope
The aim of clustering is to find intrinsic structures in documents, and organi(e them into meaningful subgroups for further study and analysis. There have been many clustering algorithms published every year. The main wor& is to develop two similarity measures for document clustering which provides ma$imum efficiency and performance. 1ur first ob"ective is to derive a novel method for measuring similarity between data

ob"ects in sparse and high#dimensional domain, particularly te$t documents. /rom the proposed similarity measure, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are fast and scalable li&e &# means, but are also capable of providing high#quality and consistent performance. The main goal is to perform document clustering by optimi(ing the two similarity measures . 't is an enabling technique for a wide range of information retrieval tas&s such as efficient organi(ation, browsing and summari(ation of large volumes of te$t documents. !luster analysis aims to organi(e a collection of patterns into clusters based on similarity. !lustering has its root in many fields, such as mathematics, computer science, statistics, biology, and economics. 'n different application domains, a variety of clustering techniques have been developed, depending on the methods used to represent data, the measures of similarity between data ob"ects, and the techniques for grouping data ob"ects into clusters .

1.5

References

6,7 %uc Thang 8guyen, 9ihui !hen and !hee :eong !han, 0!lustering with )ultiviewpoint#;ased Similarity )easure., '+++ Transactions on :nowledge and %ata +ngineering, -<,-.

6-7 =. >hao and ?. :arypis, 0!riterion /unctions for %ocument !lustering@ +$periments and nalysis,. technical report, %ept. of !omputer Science, *niv. of )innesota, -<<-.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 4

627 S. >hong and A. ?hosh, 0 'ts pplications, -<<2.

!omparative Study of ?enerative )odels for %ocument

!lustering,. 5roc. S' ) 'ntl !onf. %ata )ining Bor&shop !lustering Cigh %imensional %ata and

2.
2.1

Overall Description
Product Perspective
Be are facing an ever increasing volume of te$t documents. The abundant te$ts flowing

over the 'nternet, huge collections of documents in digital libraries and repositories, and digiti(ed personal information such as blog articles and emails are piling up quic&ly everyday. These have brought challenges for the effective and efficient organi(ation of te$t documents. !lustering in general is an important and useful technique that automatically organi(es a collection with a substantial number of data ob"ects into a much smaller number of coherent groups 'n the particular scenario of te$t documents, clustering has proven to be an effective approach for quite some timeDand an interesting research problem as well. 't is becoming even more interesting and demanding with the development of the Borld Bide Beb and the evolution of Beb -.<. /or e$ample, results returned by search engines are clustered to help users quic&ly identify and focus on the relevant set of results. !ustomer comments are clustered in many online stores, such as annotations. %ocument clustering has become an increasingly important technique for unsupervised document organi(ation, automatic topic e$traction, and fast information retrieval or filtering. /or e$ample, a web search engine often returns thousands of pages in response to a broad query, ma&ing it difficult for users to browse or to identify relevant information. !lustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by search engines such as ?oogle 8ews. Similarly, a large database of documents can be pre#clustered to facilitate query processing by searching only the cluster that is closest to the query. 'n this pro"ect certain concepts are need to be e$plained very briefly. ma(on.com, to provide collaborative recommendations. 'n collaborative boo&mar&ing or tagging, clusters of users that share certain traits are identified by their

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 5

2.2

Product Functions

The main purpose of this pro"ect is to ma$imi(e user utility. Documents< The abundant te$ts flowing over the 'nternet, huge collections of documents in digital libraries and repositories, and digiti(ed personal information such as blog articles and emails are piling up quic&ly everyday. These have brought challenges for the effective and efficient organi(ation of te$t documents. /0e type of Similarity Measure< here ' am using one of the Similarity )easure *sing )ulti# Reference points for %ocument !lustering. The clustering algorithm uses above Similarity )easure for forming into clusters.

2.3

User Classes and Characteristics

The users are assumed to have basic &nowledge of the computers and have more &nowledge of the data mining. The user needs to &now the e$act nature of the submitted "ob, such as the e$ecution time as well as resources required, and must possess the technical &now#how to use the interface for submitting "obs. They can rectify the small problems that may arise due to dis& crashes, power failures and other catastrophes to maintain the system. The proper user interface, users manual, online help and the guide to install and maintain the system must be sufficient to educate the users on how to use the system without any problems.

2.4

Operating Environment
The target operating system is Windows XP Professional. It also requires Java run time and

compile time environments along with a tool named as NET E!N" I#E. The hardware requirements for this system is minimal requirements for running the application. $or storing collection of documents % need a data&ase.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 6

2.5

Design and Implementation Constraints


enough storage for &eeping document

=ardware limitations< /he developers dont have dataset. They may also have timing constraints.

The current constraints on the pro"ect are related to the provision of hardware resources to implement and test a high#performance cluster. t present, a networ& of four 5entium# 'E wor&stations, with a ,-F )b R ), serves as the cluster. /or better performance analysis, a larger number of dedicated requirements would be beneficial.

2.6

User Documentation
*ser manual and guide will be made available for troubleshooting and help. The user

manual will contain detailed information about the usage of the product from manual perspective to an e$pert networ&Gsystem user. The manual and summary of available online. application shall also be made

2.7

Assumptions and Dependencies

ssume that the client will have The users have sufficient &nowledge of computers. The users have sufficient &nowledge about information retrieval. The computer should have all tools for running application.. The users &now the +nglish language, as the user interface will be provided in +nglish The product can access document database

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 7

3.
3.1

External Interface Requirements


User Interfaces
The user should be simple and easy to understand and use. lso be an interactive

interface .The system should prompt for the user for proper input criteria The software provides good graphical interface for the user can operate on the system, performing the required tas& such as upload, viewing the details of the result. The minimal requirements are that the cluster user would be able to interact with the system through the prompt, or through the interface provided by the %oris Stur(enberger. There will be a different command for each of the following actions@ submit te$t documents display the clusters as result

'nput %esign considered the following things@ Bhat data should be given as inputH Cow the data should be arranged or codedH The dialog to guide the operating personnel in providing input.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 8

3.2

Hardware Interfaces

This requires various hardware components for the system and also include following hardware interfaces !5* usage )emory usage Te$t file creation

3.3

Software Interfaces

This product requires following specific software components


Aava language 8et beans '%+ I.<., Bindows J5GBindows -<<< 9arge data bases

3.4

Communications Interfaces

Beb browser does the following tas&s . 5arsing is the first step when the document enters the process state. 5arsing is defined as the separation or identification of meta tags in a CT)9 document. Cere, the raw CT)9 file is read and it is parsed through all the nodes in the tree structure.

4.

System Features

Document pre-processing steps

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 9

To&eni(ation@ a list of to&ens.

document is treated as a string (or bag of words), and then partitioned into

Removing stop words@ Stop words are frequently occurring, insignificant words. This step eliminates the stop words.

Stemming word@ This step is the process of conflating to&ens to their root form (connection #K connect).

Document representation

?enerating 8#distinct words from the corpora and call them as inde$ terms (or the

vocabulary). The document collection is then represented as a 8#dimensional vector in term space.

Computing /erm weig0ts


Term /requency. 'nverse %ocument /requency. !ompute the T/#'%/ weighting.

Measuring similarity between two documents


!apturing the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 10

Use Case Diagram for fuctioning Clustering !lustering is a division of data into groups of similar ob"ects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. The similar documents are grouped together in a cluster, if their similarity measure is less than a specified threshold

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 11

5.
5.1

Other Nonfunctional Requirements


Performance Requirements

The capability of the computer depends on the performance of the software. The software can ta&e any number of inputs provided the storage si(e is larger enough. This would depend on the available memory space.

Response /ime
The 5age or 'nformation page should be ta&en within few seconds . The system shall respond to the member in not less than two seconds from the time of the request submittal. The system shall be allowed to ta&e more time when doing large processing tas&s.

/0roug0put
The number of clusters is directly dependent on the number of usersL

Resource Utili>ation
The resources are modified according the user requirements and also according to the latest similarity measures.

5.2

Safety Requirements
There are no specific safety requirements associated with the proposed system. The

%ocument !lustering based on Similarity )easure *sing )ulti#Reference points safety ha(ards. The level of security is provided to this product parameters within the Similarity )easure of algorithm

is

composed of well &nown and commonly used hardware and software which do not cause any

is that dont allow to modify any

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 12

5.3

Security Requirements
1nly e$pertise personnel are allowed to use the product and go through selection

procedures Similarly, dont allow to change features and data with in the documents of the corpus at runtime.

5.4

Software Quality Attributes

Maintainability< There will be no maintained requirement for the software. The


database is provided by the end user and therefore is maintained by this user.

!ortability< The system is portable. '8ailability< This system will available only until the system on which it is install,
is running.

Scalability< pplicable.

Usability
M The system shall allow the users to access the system from the 'nternet using CT)9 or its derivative technologies. The system uses a web browser as an interface. M Since all users are familiar with the general usage of browsers, no specific training is required. M The system is user friendly and self#e$planatory.

Reliability
The system has to be very reliable due to the importance of similarity measure used.

'8ailability
The system is available ,<<N for the user and is used -3 hrs a day and 2O4 days a year. The system shall be operational -3 hours a day and I days a wee&.

Mean /ime between 6ailures $M/?6*


The system will be developed in such a way that it may fail once in - years.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 13

'ccuracy
The accuracy of the system is more when compared with remaining similarity measures such as cosine and spherical &# means similarity measures.

'ccess Reliability
The system shall provide ,<<N access reliability.

5.5

Business Rules
%ocument !lustering based on Similarity )easure *sing )ulti#Reference points is most

suitable for mar&eting managers and &nowledge analysts of large enterprises in order to analy(e the more frequent retrieved documents from huge collection data repositories . The product should be used carefully without loss of data. )a"or advantage is that it gives more accurate results than any other similarity measures.

6.

Other Requirements

There are no other requirements .

Appendix A: Glossary
Clustering is a common descriptive tas& where one see&s to identify a finite set of categories or clusters to describe the data. Cluster 'nalysis 'n multivariate analysis, cluster analysis refers to methods used to divide up ob"ects into similar groups, or, more precisely, groups whose members are all close to one another on various dimensions being measured. 'n cluster analysis, one does not start with any apriori notion of group characteristics. The definition of clusters emerges entirely from the cluster analysis # i.e. from the process of identifying PclumpsP of ob"ects. Similarity method which determines the strength of the relationship between variables, andGor a means to test whether the relationship is stronger than e$pected due to the null hypothesis. *sually, we are interested in the relationship between two variables, x and y. The correlation coefficient r is one measure of the strength of the relationship.

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 14

Data

%ata is the raw material of a system supplied by data producers and is used by information consumers to create information.

Data mining

technique using software tools geared for the user who typically does not &now e$actly what heQs searching for, but is loo&ing for particular patterns or trends. %ata mining is the process of shifting through large amounts of data to produce data content relationships. 't can predict future trends and behaviors, allowing businesses to ma&e proactive, &nowledge#driven decisions. This is also &nown as data surfing.

M/?6 /6--D6

)ean Time between /ailures


Term /requency# 'nverse %ocument /requency

Software Requirements Specification for Document Clustering based on Similarity Measure Using Multi-Reference points

Page 15

Appendix B: Analysis Models


The %/% is also called as bubble chart. 't is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system Data flow Diagram for Document clustering

Appendix C: To Be Determined List


Collect a numbered list of t!e "#D $to be determined% references t!at remain in t!e SRS so t!ey can be trac&ed to closure.'