Professional Documents
Culture Documents
ISSN : 0973-8215
IK International Publishing House Pvt. Ltd., New Delhi, India
The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form
of Software Requirement Specification (SRS). SRS document needed for software development is mostly
written in Natural Language(NL) convenient for the client. From the SRS document only, the class name,
its attributes and the functions incorporated in the body of the class are traced based on pre-knowledge of
analyst. The paper intends to present a review on Object Oriented (OO) analysis using Natural Language
Processing (NLP) techniques. This analysis can be manual where domain expert helps to generate the
required diagram or automated system, where the system generates the required digram, from the input in
the form of SRS.
Keywords : Natural Language, Natural Language Processing, Object Oriented, Parts Of Speech, Software
Development Life Cycle, Software Requirement Specification.
38
Object Oriented Analysis using Natural Language Processing concepts: A Review 39
mated tools, other use manual approach and 2. PROPOSAL OF THE USE OF NLP
few combine both tool and manual approach IN OO ANALYSIS
to obtain the different elements of OO analysis
. 2.1. Historical review on NLP
Jones in his paper presented a review paper
In order to examine each proposal, the follow- on NLP based on the historical study [4]. The
ing dimensions may be considered paper reviews study of NLP from late 1940’s
to present by distinguishing four phases in the
• Steps to modify the SRS into re- history of NLP. The impact of use of ma-
quired form: As discussed above the chine translation, artificial intelligence influ-
SRS is written in a form i.e., convenient ence, logico-grammatical style adaption and
to the user. During this step the SRS is massive language data attack act as the basis
modified to a format by which the process of the phase division.
of finding the keywords became easier for
analysts. Ph 1: The early phase of study on NLP was
during late 1940s to late 1960s and in
• Finding out the candidate for class
this period focus was mainly in Machine
and object from modified SRS: After
Translation (MT). Noticeable amount of
transforming the SRS to the required for-
work done in USSR, USA, Europe and
mat for analysis, then the candidates for
Japan during this period. Thus the lan-
the class name and its details are traced
guages considered for research in this pe-
out. The process of identification of the
riod were mostly Russian and English
class name and its detail can be a man-
[14]. Language syntax was mainly the
ual or a automated process. In manual
area of research in this period as syntac-
process, the domain expert analyzes text
tic processing was manifestly necessary,
to bring out “intermediate output” then
and partly carried out through implicit
automated process considers the interme-
or explicit endorsement of the idea of
diate output to generate the desired out-
syntax-driven processing. Though dur-
put.
ing this period use of computers for liter-
ary and linguistic study has began, but
The objective of this work is two-fold:
it has never been linked with NLP.
1. This analysis provides information on
available techniques for use of NLP, fur- Ph 2: Next phase of study was undertaken
ther to be considered under OO analysis. from late 1960s to late 1970s and
the work mainly focused on use of artifi-
2. It provides an overview of the current cial intelligence (AI) in NLP, with much
state for the use of NLP in OO analysis, more priority on word knowledge and
focusing on the strengths and weaknesses on its implementation in the construc-
of existing proposal. Thus researchers tion and manipulation of meaning rep-
can have a broad knowledge into the work resentations. AI was mainly considered
that already being done and that still can during this period for construction and
be carried out in this field. addressing of knowledge base or data.
In late 1960s, the prevalent theory of
The paper is organized as follows: In section linguistic is transformational grammar,
2 the existing proposals of the use of NLP for which provides the semantic information
OOA analysis are presented. Section 3 presents about NLP.
an analysis on all proposals. Finally, section
4 discusses on some concluding remarks and Ph 3: This phase was mainly concerned to the
highlights the future trend. period of late 1970s to late 1980s
40 Abinash Tripathy and Santanu Kumar Rath
Ph 4: The forth and the final phase can be i. Development of informal strategy for
attributed as the study carried out the problem: The informal strategy
from late 1980’s onwards. During should suggest the problem solution on the
this phase, the main area of research conceptual level. This step express the so-
is statistical language data processing. lution of the problem in terms of problem
The identification of linguistic occur- domain.
rence and patterns in the corpus for both
syntactic and semantic analysis, drew ii. Formalize the Strategy: The second
interest in this period. The present at- step is formalizing the solution by finding
tention on lexicon, retrieving statistical out its data types, objects, operators, and
information, and restore interest in MT. control constructs. The steps of formaliza-
tion are:
Table 1 compares the NLP research based on
time period (a) Identify the Data types: The data
types are suggested by the common
Table 1 nouns. The name of a class of beings
Comparison of NLP Research Based on Time or things are known as common noun.
Phase Time period NLP research (b) Identify the objects of those types:
Phase 1 Late 1940s to Machine Translation (MT), Lan-
Objects are suggested by Proper
late 1960s guage syntax analysis. Nouns or Direct References. The
Phase 2 Late 1960s to Use of AI, implement of word name of specific things or beings is
late 1970s knowledge to construction known as proper noun. A specific,
knowledge base or data.
previously identified being or thing
Phase 3 Late 1970s to Grammatico-logical analysis, without necessarily referring to it by
late 1980s Transformation of dictionary to
machine readable form and text name is known as direct reference.
corpora validation.
(c) Identify the operators for the objects:
Phase 4 Late 1980s Statical data processing, Both
onwards syntactic and semantic analysis
Operators are suggested by verb, at-
of Corpus and restoration of MT. tribute, predicate or descriptive ex-
pression. Attribute is a property, as-
sociation, characteristics or compo-
2.2. OO Analysis using NLP approaches nent of something. Predicate desig-
In course of this section, the study made by nates a property or relation that can
various authors are analyzed on the basic of be consider True or false i.e., to hold
how they transform the SRS and how the can- or not to hold. A descriptive expres-
didate for the class name and its details are sion is a characterization for which
found out from transformed SRS. there may be some particular object.
Object Oriented Analysis using Natural Language Processing concepts: A Review 41
(d) The control structures are directly 2.2.2. Saeki et al.,, 1989 [6]
provided by the English language. The paper by Saeki et al., discusses the pro-
cess of derivation of formal specification from
iii. Segregate the solution into two an informal specification written in natural
parts, A package and subprogram: language. The informal specification contains
The package will contain the formalization important information leading to their formal
of the problem domain, that is, the data specification or the prototype program. Then
types and their operators. Then subpro- the similarity between the structure of the
gram(s) will contain the specific steps (ex- words and the structure of software component
pressed in terms of the data types and op- is analyzed. In this paper, the “Lift Control
erators defined in the package) for solving System” example has been explained as an
the particular problem. informal specification to explain the process.
During the course of the paper, different types The process consist of three major steps as fol-
of nouns are analyzed i.e., the difference of lows:
common noun with proper noun, direct refer- i. Design Activity: The purpose of this
ences and mass nouns are provided. Classes step is to construct a module design doc-
of objects are referred by Common Noun but ument from informal specification. The
specific and individual objects are referred by modular design document presents the
Proper noun. Mass nouns are names of quali- modular structure of a formal specifica-
ties, substances, and activity that do not have tion, i.e., external design of class modules
an a priori organization into individual units or which contain class names, method names
instances. and message protocols. This design activ-
In this paper Abbott had taken an example of ity consists of several sub-activities. Each
“Calculating the days between two given dates” of them produce an intermediate product
to explain his technique. As per the analysis using the informal specification. The inter-
referred in the paper, the process is divided mediate products obtained are as follows:
into three different steps: • Noun table: This table contains
• In step 1, an informal strategy of the information about extracted nouns.
problem analysis is provided. In this step According to the author, noun can
the process of getting the detailed solu- be classified into different group, i.e.,
tion is being analyzed. Class noun identifies object or set of
objects, Value noun identifies the val-
• In step 2, the data types, objects, op- ues or set of values, Attribute noun
erators, and control structures are found identifies the attribute of the objects
for the specific problem form the informal and Action noun identifies the actions
specification. to be carried out.
• In step 3, the final solution is being pre- • Verb table: This table contains in-
pared. The package for the problem is formation about extracted verbs, i.e.,
assigned and the subprogram details are verb names, their categories, their
provided in this step. subjects and objectives. According
to the author, verbs can also be clas-
Though this paper is comparatively easy in sified into different groups. Relation
solving the problem of finding the candidate verb specifies the relation between ob-
for class name and its details but the process jects or between objects and their at-
is manual. A software engineer having a good tributes. State verb specifies the in-
knowledge about the domain requires to pro- ternal state of the object or the at-
vide the step-wise solution of the problem. tribute values of the object. Action
42 Abinash Tripathy and Santanu Kumar Rath
verb specifies the action to the ob- yet informal. Whenever the elaborate-
jects and Action relational verb spec- design cycle is carried out, a pair of in-
ifies the relation between the actions. formal specification and its module de-
• Action table: This table presents sign document is obtained. The elaborate
the extracted actions, their agents, activity consists of sub-activities such as
target objects and input output pa- paste, refine and an intermediate product
rameter associated with them. For a paste document is generated.
each action verb, there is always an • Paste: During this phase, the sen-
agent and its target object. The ac- tences of informal specification may
tion verb changes the state of the tar-
be paraphrased to accurate sentence
get object. In order to extract a tar- and then replace the original one. In
get object from a sentence, verb pat- the updated expression, the nouns
terns that appear in various kinds of
and verbs are extracted as classes,
natural language specification are ex- and attributes or method to be used
amined.
respectively.
• Action Relation table: The infor-
• Refine: The informal specification of
mal specification, its verb table and
each pasted module are constructed
its action verb are needed to iden-
during refine activity. During this
tify the relationship between the ac-
phase, the internal behavior and
tions. For every action to be per-
property of each class module is
formed, the sender, receiver and the
rewritten again, which are used for
message transmission between them
construction of each class.
need to be identified. in this paper,
the authors have used a rule called • Design activity for elaborate Infor-
“action relation rule” to generate a mal Specification: Before this activ-
few candidate for a sender-receiver ity, the module design document for
pair. each class is constructed from elab-
orate informal specification. During
• Module Design Document: A
this activity, each class module and
module design document is con-
method module are composed into
structed using above mentioned ta-
small sub module to realize the inter-
bles. The noun table helps to iden-
nal behavior and property.
tify objects and their attributes. The
verb table is used to extract relation- The cycle continues until a formal specifi-
ship among objects and kind of at- cation is obtained from the informal spec-
tributes, each object possess. The ification. The requirements need to be re-
module design document can have fined and made simpler and smaller during
both graphical and textual represen- this cycle.
tation. Both of them are based on
syntax of formal specification lan- iii. Software process based on Natural
guage TELL and object oriented lan- Language: During this process the de-
guage Smalltalk80. sign and elaborate process are embedded.
Before this phase, the informal specifica-
ii. Elaborate - Design Activity Cycle: tion is already converted to formal specifi-
The task of this activity is to refine and cation, the rest steps are as follows
rewrite the informal specification as per
module design document. The output of • Analyze activity: This step acquires a
this activity is a natural language descrip- problem description by means of in-
tion which is accurate, detailed, structured teraction between customer and de-
Object Oriented Analysis using Natural Language Processing concepts: A Review 43
This paper only supports the static behavior/ Step 4. Use RACE stemming algorithm to
relationship of OOAM present in NL SRS. It stem each words and store them in
does not manage the modeling behavior. a list.
2.2.4. Ibrahim and Ahmad, 2010 [10] Step 5. Use OpenNLP to parse whole doc-
ument.
This paper of Ibrahim and Ahmad proposes
method to facilitate requirement analysis pro- Step 6. From the parsed output extract
cess and extraction of class diagram from re- the words with POS Proper Nouns
quirements using NLP and Domain Ontology. (NN), Noun Phrases (NP), verb
A tool named “ Requirements Analysis and (VB) and store them in Concept-
Class Diagram Extraction (RACE) ” is being list.
proposed by the authors that analyzes the tex- Step 7. For each concept in concept-list, if
tual requirements, finding out the relationships any other concept is synonym with
and finally extracts the class diagram. present one, then it can be con-
The RACE system consists of different inter- veyed that both are semantically
nal and external components or sub-systems. related.
These can be described as follows: Step 8. For each concept in concept-list, if
any other concept is item Require-
i. OpenNLP Parser: The OpenNLP parser ment document is taken as input.
used in this paper for lexical and syntactic Step 9. Stop words are identified and
parsing. The parser takes English text as stored as Stop-words Found list
input and provides corresponding POS tag
for each word as output. Step 10. Calculate the frequency of each
words in the document, except the
ii. RACE Stemming Algorithm: Stemming is Stop words.
a process of removing affixes and suffixes Step 11. Use RACE stemming algorithm to
from a word and generating the base word. stem each word and store them in
The generated base word reduces the re- a list.
dundancy and increases efficiency.
Step 12. Use OpenNLP to parse whole doc-
iii. WordNet: It is used to validate the seman- ument.
tic of the sentences that generated after Step 13. From the parsed output extract
syntactic analysis. It also helps to display the words with POS Proper
hyponyms for a selected noun, which helps Nouns(NN), Noun Phrases (NP),
to know the “a kind of” relationship. verb (VB) and store them in
Concept-list.
iv. Concept Extraction Engine: This module Step 14. For each concept in concept-list, if
is used to extract concepts according to the any other concept is synonym with
requirement document. The algorithm for present one then it can be con-
this module is as follows: veyed that both are semantically
related.
Step 1. Requirement document is taken as
input. Step 15. For each concept in concept-list,
if any other concept is hyponyms
Step 2. Stop words are identified and
with present one i.e., lexically
stored as Stop-words Found list
same then it can be conveyed
Step 3. Calculate the frequency of each that former is a kind of later and
words in the document except the saved in Generalization-list. with
Stop words. present one i.e., lexically same
46 Abinash Tripathy and Santanu Kumar Rath
• Relationship Identification Rules: • The final output is in the form of full text,
The rules for relationship identifi- word-list and UML model in parallel. So
cation is as follows: the user can compare all of them.
Object Oriented Analysis using Natural Language Processing concepts: A Review 47
• Key Word In Context ‘KWIC’ view dis- • Suggest operations for combining el-
plays the words or group of words in sen- ements to class model.
tences. • Add textual context helpful for pro-
• Hypertext description model used to help cessing model builder.
in documentation of the model. • Generate textual description of
model for documentation and valida-
• Completed model can be exported any tion of model with domain expert.
CASE tool or any model can be imported
from any CASE tool to LIDA. ii. LIDA text description: LIDA uses Model-
Explainer an integrated tool, which gen-
LIDA consists of following components: erates the hypertext description document
i. Text analysis environment: This compo- for object model. This document is gener-
nent is the main component of LIDA as ated from customized text which includes
it provides the central functionality. The the class information like superclass, sub-
main functionalities this component per- class, attributes, operation and association
forms are: with other class. These descriptions help
to obtain additional information about the
• It takes the text input in RTF and final result.
ASCII format.
The following Table 2 provides a comparative
• Then it assigns POS tag to each word. analysis of the approaches to obtain the ele-
For POS tagging, It uses MXPOST, a ments of OO analysis from SRS.
software tool developed at University
of Pennsylvania,USA. 3. ANALYSIS OF APPROACHES
• Base word is obtained form each word
and their frequency is calculated. In present day scenario, the use of object ori-
ented system is widely applied for software
• Multi-word phases are checked for a development[15-19]. The customer mention all
given base word. it’s requirements in a document called Software
• Users are allowed to mark the words requirement specification. This SRS document
or phases as candidate model and is written in NL which is understandable by
highlights these words in the text . the customer side, but it is sometimes incom-
• Retrieve textual context of marked plete and ambiguous. The development team
words. need to go through these document and gener-
ate UML diagram and analyze on basic of OO
Mode editing environment: This model of- analysis. The UML being very often used for
fers the functionality requirement to gen- OOA tries to fix the class diagram where class
erate a model from the proposed model is also basic element of OO system.
marked in LIDAs Text Analyzing Environ-
During the course of the paper, it can be found
ment. The functional features of this com-
out that different approaches are adapted to
ponents are:
generate the class diagram and its correspond-
• Display list of candidate model ele- ing details. These approaches can be men-
ment marked and add them to model tioned as follows:
editing environment. Transfer of in- • The software requirement document is
formation between text analysis en- considered as an input for the analysis.
vironment and mode editing environ-
ment helps developer to analyze the • As it is written in NL, it contains
problem in details. some ambiguities or unwanted informa-
48 Abinash Tripathy and Santanu Kumar Rath
Table 2
Comparison of generation of OO elements from SRS using manual approach
Abbott [5] This paper analyzes the English statement of SRS Comparatively easy to find out the candidate Domain knowledge is required for the analysis.
and generate elements of OO analysis. Identifying for class and it’s details
data type, objects, operators and Control structure
Saeki et al. [6] This paper derive formal specification from informal The informal SRS document is refined and The large size of informal specification may be
specification in English and from that obtain ele- rewritten to a formal document understand- a concern and also further analysis on nouns
ments of OO. Generate Noun table, Verb table, Ac- able for everyone. needed as the proposed approach is mainly
tion table, Action Relation table and Module Design verb oriented.
document from formal specification
Nanduri and Rugaber [7] This paper extract the candidate objects, methods A graphical model is generated from the re- Parser inadequacy, ambiguous and incomplete
and its association from requirement document then quirement document specification and lack of domain knowledge
composing them to generate object model. Use link makes the final result unsatisfactory.
grammar based parser to parse sentence and generate
the object diagram from knowledge gained.
Juristo et. al.[8] This paper uses the linguistic information from infor- The proposed approach prevent incorrect As the process totally depend on requirement
mal specification. Analyzes the information semanti- modeling construct and model can be repeat- specification, an assumption taken that the
cally and syntactically and finally apply semi-formal able textual document is correct.
procedure to obtain OO system component.
D. Popescu et al. [9] This paper identify the ambiguities in NL SRS. The The OOAM diagram is generated using tool, Only the static behavior of SRS is considered,
proposed “Dowser” tool use constraining grammar, again verified by human analyst for better ac- it does not manage modeling behavior.
NL parsing and Transformation rule to generate the curacy.
Object model.
Ibrahim and Ahmad [10] This paper uses the requirement analysis process and RACE find the concepts based on nouns, noun It could not find out one to one, one to many
extract class diagram using NLP and Domain On- phase and verb analysis. It can able to find or many to one relationship and RACE is not
tology. The proposed RACE tool analyzes textual generalization, association , composition, ag- platform independent, it works only in win-
requirements, finds relationship among them and fi- gregation and dependency relationships. dows platform.
nally generate class diagram.
Harmain and Gaizauskas [11] This paper uses CM-Builder tool for OO analysis. The proposed model used different linguistic The final output is obtained in CDIF form
After analysis of software requirement document a technique to analyze and define rule to gen- which is not understandable by everyone and
discourse model is designed from which the object erate candidate for class model. So, the am- a CASE tool supporting CDIF needs to gen-
class and relationships is generated. biguities present in the software requirement erate class diagram graphically
document do not hamper the result.
Overmyer and Rambow [12] The paper uses LIDA tool to provide linguistic in- It provides a graphical approach to analyze The text analysis carried out is mostly manual
formation assistance in model development process. the text and have features that can simplify so it is time taking and the analyst should be
This assistance facilitates the analysis and extent the the process of class generation. a domain expert which is quite difficult to find
creation of class model. out.
Mich [13] This paper uses an NL-OOPS tool based on It provides an graphical interface which make In order to make the analysis fully automated,
LOLITA. The OO modeling module, use algorithms the process of generation quite easy. Again a senior analyst have to control the output and
that filter the entity and event nodes generated by this tool can be very easily integrated to other the final output class model is not at par with
LOLITA and identify classes and associations. CASE toolto support lower level development. the UML class diagram.
tion. So, in order to remove that different • After obtaining the root noun words, the
steps are carried out in all papers. higher frequency nouns are considered
and they are the most eligible onces for
• Each words in the text is tagged with a fixing class name.
POS. Then the words are combined to-
gether depending upon their POS. • For operations in class, the verbs present
in the sentence are the best candidate.
• The noun and verb tagged words are • The Adjectives present in text act as an
mainly used for class name and their op- attribute for the class for that noun it
erations respectively. So these words are tries to modify.
then stemmed to obtain the root word
and their suffixes. • In order to find the relationship between
Object Oriented Analysis using Natural Language Processing concepts: A Review 49
classes, the relation between the subject ing, in Proceedings of the Twenty-Eighth
and object of a sentence is found out. Hawaii International Conference on System
Sciences, 3, IEEE, pages 362–368, 1995.
• For other rules like multiplicity deter- 8. N Juristo, A M Moreno and M L ó pez.
mines are used that specify the relation- How to Use Linguistic Instruments for Object-
ship like one-one, one-many, many-one, Oriented Analysis, IEEE software, 17(3):80–
many-many. 89, 2000.
9. D Popescu, S Rugaber, N Medvidovic and
4. CONCLUSIONS AND FUTURE D M Berry. Reducing Ambiguities in Re-
SCOPE quirements Specifications via Automatically
Created Object-Oriented Models, in Innova-
There are different tools that have been devel- tions for Requirement Analysis. From Stake-
oped to analyze the text; but as there is no holders Needs to Formal Designs, Springer,
exhaustive dictionary which helps to provide pages 103–124, 2008.
POS for each words. Although few tools gen- 10. M Ibrahim and R Ahmad. Class Diagram Ex-
erate the class diagram but different authors traction from Textual Requirements using Nat-
suggest that a manual intervention is needed to ural Language Processing Techniques, in Pro-
improve the final result. Until and unless there ceedings of IEEE 2010 Second International
is specific rules for writing the SRS document, Conference on Computer Research and Devel-
the ambiguities continue to be present in it and opment, pages 200–204, 2010.
that cause issue in compiling the SRS. Though 11. H M Harmain and R Gaizauskas. Cm-builder:
many approaches have been proposed and also An Automated Nl-based Case Tool, in Pro-
are used to obtain the elements of OO analysis ceedings of 15th IEEE International Con-
still there is scope for research in this area. To ference on Automated Software Engineering,
automated understanding the SRS written in pages 45–53, 2000.
informal NL is also an issue in research. 12. S P Overmyer, B Lavoie and O Rambow. Con-
ceptual Modeling through Linguistic Analysis
REFERENCES using Lida, in Proceedings of the 23rd Inter-
national Conference on Software Engineering,
1. J Rumbaugh, I Jacobson and G Booch. Unified IEEE Computer Society, pages 401–410, 2001.
Modeling Language Reference Manual, Pear- 13. L Mich and R Garigliano. Nl-oops: A Require-
son Higher Education, 2004. ments Analysis Tool Based on Natural Lan-
2. R S Pressman. Software Engineering: A Prac- guage Processing, in Proceedings of Third In-
titioner’s Approach, McGraw-hill New York, 7, ternational Conference on Data Mining Meth-
2010. ods and Databases for Engineering, Bologna,
3. E Kumar. Natural Language Processing, IK Italy, 2002.
International Pvt Ltd, 2011. 14. A D Booth. Machine Translation, North-
4. K S Jones. Natural Language Processing: A Holland Publishing Company, 1967.
Historical Review, in Current Issues in Com- 15. J Rumbaugh, M Blaha, W Premerlani, F Eddy,
putational Linguistics: in Honour of Don W E Lorensen et al.. Object-oriented Modeling
Walker, Springer, pages 3–16, 1994. and Design. Prentice-hall Englewood Cliffs, NJ,
5. R J Abbott. Program Design by Infor- 199, 1991.
mal English Descriptions, Commun. ACM, 16. F N Paulisch and W F Tichy. Edge: An Ex-
26(11):882–894, Nov. 1983. tendible Graph Editor, Software: Practice and
6. M Saeki, H Horai and H Enomoto. Software Experience, 20(1):S63–S88, 1990.
Development Process from Natural Language 17. M Jackson. Developing Ada programs using
Specification, in Proceedings of the 11th Inter- the Vienna Development Method, Software:
national Conference on Software Engineering, Practice and Experience, 15(3):305–318, 1985.
ser. ICSE ’89, New York, NY, USA: ACM, 18. R Gaizauskas, K Humphreys, H Cunningham
pages 64–73, 1989. and Y Wilks. University of sheffield: Descrip-
7. S Nanduri and S Rugaber. Requirements Vali- tion of the Lasie System as used for muc-6, in
dation via Automated Natural Language Pars- Proceedings of the 6th Conference on Message
50 Abinash Tripathy and Santanu Kumar Rath