You are on page 1of 26

KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

UNIT -2

 Introduction to Semantics and Knowledge Representation

 Some applications like Machine Translation.

 NLP Database Interface.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 21
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

SHORT ANSWER TYPE QUESTIONS

Ques 1. What do you mean by semantics in NLU?

Ans. The terms ‘semantics’ and ‘semantic interpretation’ usually refer to methods of representing the meanings of
natural language expressions, and of computing such meaning representations. A system for semantic analysis
determines the meaning of words in text. Semantics gives a deeper understanding of the text in sources such as
a blog post, comments in a forum, documents, group chat applications, chatbots, etc.

Ques 2. Mention few elements of language that help in semantic analysis.


Ans .Elements that support to understand the semantics of any language are :
 Hyponymy: A generic term.
 Homonymy: Two or more lexical terms with the same spelling and different meanings.
 Polysemy: Two or more terms that have the same spelling and similar meanings.
 Synonymy: Two or more lexical terms with different spellings and similar meanings.
 Antonymy: A pair of lexical terms with contrasting meanings.
 Meronomy: A relationship between a lexical term and a larger entity.

Ques 3. Mention some of the issues in knowledge representation of NLP systems.


Ans. Semantic representation of NLP systems still confront many specific difficulties as following :
 The representation of tense and aspect, of adjectival and adverbial modification.
 Nominalization, generic sentences, propositional attitudes, counterfactual conditionals, comparatives,
and generalized quantifiers.
 Many aspects of the disambiguation process and word-sense disambiguation.
 The inference of implicit causal connections, plans, goals, reasons, and so on remains a refractory
problem.

Ques 4. What are the semantic nets?


Ans : The semantic network based knowledge representation mechanism is useful where an object or concept is
associated with many attributes and where relationships between objects are important. Semantic nets have also been used
in natural language research to represent complex sentences expressed in English. Semantic network is simple
representation scheme which uses a graph of labeled nodes and labeled directed arcs to encode knowledge
 Nodes : objects, concepts, events .
 Arcs : relationships between nodes . Arcs define binary relations which hold between objects denoted by the
nodes.
Non binary relation : We can represent the generic give event as a relation involving three things:
A agent , A beneficiary , An object.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 22
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Consider the following examples:


1. Suppose we have to represent the sentence “Sima is a girl”.

Ram is taller than Hari

Ques 5. What are the conceptual tenses, state weights proposed by Schank?
Ans (i) Conceptual Tenses
past p

future f

negation /

start of a transition ts

end of a transition tf

present nil

conditional c

continuous k

interrogative ?

timeless ∞

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 23
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(ii) Conceptual States:

Ques 6 : Differentiate between semantics sand pragmatics.


Ans.
S.NO SEMANTICS PRAGMATICS
1. Semantics looks at the literal meaning Pragmatic recognizes how important
of words and the meanings that are context can be when interpreting the
created by the relationships between meaning of discourse and also considers
linguistic expressions. things such as irony, metaphors, idioms,
and implied meanings.
2. Looks at the literal meanings of Looks at the intended meaning of
words. words.
3. Limited to the relationship between Covers the relationships between words,
words. interlocutors (people engaged in the
conversation), and contexts.

Ques 7. Difference between semantic nets and partitioned nets.


Ans :
S.NO SEMANTIC NETS PARTITIONED NETS
A semantic network is a representation These are also graph representation of real world
1. of knowledge in the form of graphs knowledge but emphasizes on quantifiers also
with the help of interconnected nodes. (existential and universal both).
2. The semantic network takes a long time It takes less time
to answer the question
3. Graph is drawn as a whole single entity Partitioning of graph based on agent who creates
with IS-a and Kind-of relations the event and entity who is benefited by that
between event

LONG ANSWER TYPE QUESTIONS

Ques 8. How knowledge is represented using semantic nets? Explain with example. Mention few
advantages and disadvantages also.

Ans : Semantic networks structure the knowledge of a specific part of any information. It uses real-world
meanings that are easy to understand. For example, it uses "is a" and "is a part" inheritance hierarchies. Besides
this, they can be used to represent events and natural language sentences. The semantic network based
knowledge representation mechanism is useful where an object or concept is associated with many attributes
and where relationships between objects are important. Semantic nets have also been used in natural language
research to represent complex sentences expressed in English.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 24
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

 It represents knowledge in the form of graphs with the help of interconnected nodes. It's a widely
popular idea in artificial intelligence and natural language processing because it supports reasoning.
 A semantic network is an alternative way to represent knowledge in a logical way.
 Arcs show the relationships between objects.

A semantic net showing relationship and attributes of Base Ball Player


To store knowledge semantic net has following components:
 Lexical Component: Nodes represent physical entities, links, and labels. The links show the
relationship between the objects. Labels, on the other hand, denote particular objects and their
relationships.
 Structural Component: The links and nodes form a diagram according to the direction.
 Semantic Components: In the semantic component, all the definitions are related to the links, and label
of nodes while facts are dependent on the approved areas.
 Procedural part: It has constructors and destructors. Constructors allow the creation of new links and
nodes while destructors permit the removal of links and nodes.

ADVANTAGES OF SEMANTIC NET


 The semantic network is a natural way to represent knowledge in the world
 It communicates information in a transparent form.
 It’s easy to decode.

DISADVANTAGES OF SEMANTIC NET


 The semantic network takes a long time to answer the question. For instance, we need to study the
whole network to get a simple answer and the worst part is it’s also possible that we end up with no
answer.
 It stores information like a human brain, but in reality, it’s not possible to create that sort of vast
network.
 These representations are confusing as there are quantifiers like “for all”, “for some”, “none”, etc. seem
missing.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 25
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Ques 9. Eplain about partitioned nets and their properties. How information is deduced through
partitioned nets?

Ans. Partitioned Semantic Network : Some complex sentences are there which cannot be represented by simple
semantic nets and for this we have to follow the technique partitioned semantic networks. Partitioned semantic
net allow for
1. Propositions to be made without commitment to truth.
2. Expressions to be quantified (Universal or Existential Quantification)
In partitioned semantic network, the network is broken into spaces which consist of groups of nodes and arcs
and regard each space as a node.
Examples: (i) All Seema are eating Apples

(ii) Every Dog has bitten a shopkeeper.

In above examples GS stands for general statement of real world which is true for all the instances of a class
which hold some properties. GS has the form of another semantic net which shoes the relationship between
objects w.r.t some event which is occurred.
The g is an instance of GS which universally quantifies the event.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 26
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Information in partitioned nets can be deduced by the following techniques:


(a)Recognizing Text Entailment : Given some pairs of texts as input, the task of recognizing text entailment is to
identify whether the semantic meaning of one text is entailed or can be inferred from another text. The output is binary

(b).Paraphrasing Task: Given some pairs of texts as input, the task of paraphrasing is to recognize whether each pair of
texts captures a paraphrase/semantic equivalence relationship. A paraphrase is the restatement of the meaning of a text
using different words. The output is binary (paraphrase or non-paraphrase).

(c)Fact-checking : Given some claims as input, the task of fact-checking is to check the factuality of these claims.
Factuality indicates the degree of being actual in terms of right or wrong. The predefined output can be binary (e.g., true
or false) or multi-class (e.g., true, false, half-true, mostly-true, etc.)

(d) Relation Extraction : Given some texts and entities from the texts as input, the task of relation extraction is to
identify the semantic relations between two or more entities. The semantic relations are predefined, considering direction.
The direction of relation means who modifies what

Ques 10. How humans can help a machine translation system to produce better quality. What are various
methods of machine translation?

Ans. We can adopt following practices to produce better quality in machine translation systems.

 Use short sentences


 Make sure your sentence structure is well-written
 Aim for simple sentence structure
 Use adverbs concisely
 Avoid industry jargon
 Stay away from slang
 Avoid compound words
 Don’t use ambiguous words
Methods or approaches of machine translation applications are as described below:

(a).RULE BASED MACHINE TRANSLATION (RBMT): RBMT is called Knowledge Based Machine
Translation that retrieves rules from bilingual dictionaries a grammars based on linguistic information about source and
target languages. RBMT generates target sentences on the basis of syntactic, morphological and semantic regularities of
each language. It converts source language structures to target language structures and it is extensible and maintainable .
There are three types of RBMT systems:
(i) Direct translation systems

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 27
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(ii) Transfer based systems


(iii) Interlingua based systems.

Direct method (Dictionary Based Machine Translation): Source language text are translated without passing through
an intermediary representation. The words will be translated as a dictionary does word by word, usually without much
correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis.
Anusaarka is the example of system that uses direct approach. Indian Institute of Information Technology,
Hyderabad, develops it.

Transfer Rules Based Machine Translation Systems: Morphological and syntactical analysis is the
fundamental approaches in Transfer based systems. Here source language text is converted into less language
specific representation and same level of abstraction is generated with the help of grammar rules and bilingual
dictionaries.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 28
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

In the transfer approach of translation divergence, there is transfer rule for transforming a source language (SL)
sentence into target language (TL), by performing lexical and structural manipulations Mantra is a transfer
based tool which is a funded project of India Government

Interlingual RBMT Systems (Interlingua) This model is indented to make linguistic homogeneity across the world. In
this method, source language is translated into an intermediary representation which does not depends on any languages.
Target language is derived from this auxiliary form of representation.

Challenge with Interlingual Rules Based Machine Translation


 Hard to handle exceptions to rule for interlingual.
 The number of rules will grow drastically in case of general translation systems.

Corpus based machine translation : One of the main methods of machine translation is Corpus Based Machine
Translation because high level of accuracy is achieved at the time of translation by this method. Large volumes of
translations are presented after the development of corpus based system that is used in various computer-aided translation
applications. Following is the different types of Corpus Based Machine Translation models.

(a).Statistical Machine Translation (SMT) : Statistical models are applied in this method to create translated output
with the assistance of bilingual corpora. The concept of Statistical Machine Translation comes from information theory.
The important feature of this method is no customization work is required by linguists because the tool learns translation
methods through statistical analysis of bilingual corpora

(b).Example based Machine Translation System: This method is also called as Memory based translation in which set
of sentences from source language is given and generates corresponding translations of target language with point to point
mapping. Here examples are used to convert similar types of sentences and previously translated sentence repeated, the
same translation is likely to be correct again . The main advantage of this model is it work well with small set of data and
possible to generate output more quickly by train the translation program. Example based method is mainly used to
translate two totally different languages like Japanese and English as in. It is not possible to apply deep linguistic analysis
that is one of the main drawbacks of Example based engine. PanEBMT is an example of EBMT tool.

(c).Hybrid Machine Translation: HMT takes the advantages of RBMT and Statistical Machine Translation. It uses
RBMT as baseline and refines the rules through statistical models. Rules are used to pre-process data in an attempt to
better guide the statistical engine. Hybrid model differ in various ways.

 Rules Post-Processed by Statistics : Rule based tool is used for translation at first. Statistical model is applied to
adjust the translated output of rule based tool.
 Statistics Guided by Rules In this method, rules are applied to pre-process input that gives better guidance to
statistical tool. Rules are also used to post-process the statistical output that caused to normalized output. This
B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)
Page 29
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

method has more flexibility, power and control at the translation time.

Challenge with HYBRID Based Machine Translation:


 Speech agreement mistakes.
 Extra punctuations
 Wrong capitalization.

Ques 11. What are optimization techniques in semantic nets, explain?

Ans. Frame based technique is used ti optimize semantic networks by incorporating reusability in language
system via inheritance. Frame based representation is a development of semantic nets and allow us to express the idea
of inheritance. A Frame System consists of a set of frames (or nodes), which are connected together by relations. Each
frame describes either an instance or a class. Each frame has one or more slots, which are assigned slot values. This is the
way in which the frame system is built up. Rather than simply having links between frames, each relationship is expressed
by a value being placed in a slot. Example:

Diagrammatic representation of frames is as mentioned below:

When we say, “Tommy is a dog” we really mean, “Tommy is an instance of the class dog” or “Tommy is a
member of the class dogs”. Why are Frames useful? The main advantage of using frame-based systems for
expert systems is that all information about a particular object is stored in one place(optimization takes place).

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 30
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Ques 12 . Explain the goal of conceptual dependency. What are its components and primitive actions?

Ans : Conceptual Dependency originally developed to represent knowledge acquired from natural language
input. This was proposed by Schank. The goals of this theory are:

 To provide automated reasoning/ inference from sentences.


 To be independent of the words used in the original input i.e. different words and structures represent
the same concept.
 Language-independent meaning representation that means for any two or more sentences that are
identical in meaning there should be only one representation of that meaning.

Components of Conceptual Dependency Graph.

 A structure into which nodes representing information can be placed.


 A specific set of primitives.
 Information of Tenses and Moods.
 Sentences are represented as a series of diagrams depicting actions using both abstract and real physical
situations.

 The agent and the objects are represented.


 The actions are built up from a set of primitive acts which can be modified by tense.
 Focuses on concepts instead of syntax.
 Focuses on understanding instead of structure.

(A) Five Primitives for Physical Actions

INGEST: to take something inside an animate object.


EXPEL: to take something from inside an animate object and force it out.
GRASP: to physically grasp an object
MOVE: to move a body part
PROPEL: to apply a force to

(B) Other Primitive Actions

 State Changes (physical and abstract transfers)

PTRANS: to change the location of a physical object.


ATRANS: to change an abstract relationship of a physical object.
 Mental acts

MTRANS: to transfer information mentally.


MBUILD: to create or combine thoughts.
(C) Instruments for other ACTs

SPEAK: to produce a sound.


ATTEND: to direct a sense organ or focus an organ towards a stimulus.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 31
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Ques 13. What are various conceptual categories, conceptual roles and conceptual syntax rules applied
in conceptual dependency.

Ans . Conceptual Categories


PP (picture producer) : physical object. Actors must be an animated PP, or a natural force.
ACT : One of eleven primitive actions.
LOC : Location.
T : Time.
AA (action aider) : modifications of features of an ACT. e.g., speed factor in PROPEL.
PA : Attributes of an object, of the form STATE(VALUE). e.g., COLOR (red).

Conceptual roles
Conceptualization: The basic unit of the conceptual level of understanding.
Actor: The performer of an ACT.
ACT: An action done to an object.
Object: A thing that is acted upon.
Recipient: The receiver of an object as the result of an ACT.
Direction: The location that an ACT is directed toward.
State: The state that an object is in.

Conceptual Syntax Rules (Pictorial representation)

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 32
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Ques 14. Draw the conceptual graphs for following sentences:


(i) John gave Mary a book.
(ii) John sold his car to Bill
(iii) John annoyed Mary
(iv) John grew the plants with fertilizer.
(v) John threw a ball to Mary.

Ans. (i)John gave Mary a book.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 33
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(ii) John sold his car to Bill

(iii)John annoyed Mary

(iv)John grew the plants with fertilizer.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 34
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(v) John threw a ball to Mary.

Ques 15.What is corpus? Explain its types and importance.

Ans . Corpus is a large collection of authentic text (i.e., samples of language produced in genuine
communicative situations), and corpus linguistics as any form of linguistic inquiry based on data derived from
such a corpus.
A corpus can be defined as a systematic collection of naturally occurring texts (of both written and spoken
language). “Systematic” means that the structure and contents of the corpus follows certain extra-linguistic
principles (“sampling principles”, i.e. principles on the basis of which the texts included were chosen).
For example, a corpus is often restricted to certain text types, to one or several varieties of English, and to a
certain time span. If several subcategories (e.g. several text types, varieties etc.) are represented in a corpus,
these are often represented by the same amount of text. “Systematic” also means that information on the exact
composition of the corpus is available to the researcher (including the number of words in each category and
in the whole corpus, how the texts included in the corpus were sampled etc).
The four major points of criticism leveled at the use of corpus data in linguistic research are the
following:
1. Corpora are usage data and thus of no use in studying linguistic knowledge.
2. Corpora and the data derived from them are necessarily incomplete;
3. Corpora contain only linguistic forms (represented as graphemic strings), but no information about the
semantics, pragmatics, etc. of these forms; and
4. Corpora do not contain negative evidence, i.e., they can only tell us what is possible in a given language, but
not what is not possible.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 35
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

The four main characteristics of the modern corpus. ƒ


 Sampling and representativeness ƒ
 Finite size ƒ
 Machine-readable form
 A standard reference

Corpus Representativeness
Representativeness is a defining feature of corpus design. The following definitions from two great researchers
Leech and Biber, will help us understand corpus representativeness −

 According to Leech (1991), “A corpus is thought to be representative of the language variety it is


supposed to represent if the findings based on its contents can be generalized to the said language
variety”.

 According to Biber (1993), “Representativeness refers to the extent to which a sample includes the full
range of variability in a population”.

Corpus Balance : This is defined as the range of genre included in a corpus. A balanced corpus covers a wide
range of text categories, which are supposed to be representatives of the language. We do not have any reliable
scientific measure for balance but the best estimation and intuition works in this concern.

Sampling: Another important element of corpus design is sampling. Corpus representativeness and balance is
very closely associated with sampling. That is why we can say that sampling is inescapable in corpus building.

 According to Biber(1993), “Some of the first considerations in constructing a corpus concern the
overall design: for example, the kinds of texts included, the number of texts, the selection of particular
texts, the selection of text samples from within texts, and the length of text samples. Each of these
involves a sampling decision, either conscious or not.”

 Sampling unit − It refers to the unit which requires a sample. For example, for written text, a sampling
unit may be a newspaper, journal or a book.

 Sampling frame − The list of al sampling units is called a sampling frame.

 Population − It may be referred as the assembly of all sampling units. It is defined in terms of language
production, language reception or language as a product.

Corpus Size : It defines how large the corpus should be? There is no specific answer to this question. The size
of the corpus depends upon the purpose for which it is intended as well as on some practical considerations as
follows :

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 36
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

 Kind of query anticipated from the user.


 The methodology used by the users to study the data.
 Availability of the source of data.

Tree Bank Corpus : It may be defined as linguistically parsed text corpus that annotates syntactic or semantic
sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the most common way of
representing the grammatical analysis is by means of a tree structure. Generally, Treebanks are created on the
top of a corpus, which has already been annotated with part-of-speech tags.

Types of Tree Bank Corpus

 Semantic Treebanks : These Tree banks use a formal representation of sentence’s semantic structure.
They vary in the depth of their semantic representation..

 Syntactic Tree banks : Opposite to the semantic Tree banks, inputs to the Syntactic Treebank systems
are expressions of the formal language obtained from the conversion of parsed Treebank data. The
outputs of such systems are predicate logic based meaning representation.

Prop Bank Corpus: Prop Bank more specifically called “Proposition Bank” is a corpus, which is
annotated with verbal propositions and their arguments. The corpus is a verb-oriented resource; the
annotations here are more closely related to the syntactic level.

VerbNet(VN): Verb Net(VN) is the hierarchical domain-independent and largest lexical resource present in
English that incorporates both semantic as well as syntactic information about its contents.

WordNet: Word Net, created by Princeton is a lexical database for English language. It is the part of the NLTK
corpus. In Word Net, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms
called Synsets. All the synsets are linked with the help of conceptual-semantic and lexical relations.

Ques 16. How is intelligent database created for NLP system? Explain with example.

Ans : The importance of NLIDB system is that it makes it easy for users with no programming experience to
deal with a database.User can just use Natural Language to interact with a database which is very simple and
easy. Also, the user would not need a special training on using such systems (maybe some training to know the
interface). The user is not forced to learn any database language.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 37
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

In general, the existing methods to interact with a database using NLP can be divided into three categories:
(1) Pattern Matching Models or Template Based Approach.
(2) Syntactic Models.
(3) Syntactic and Semantic Models.
In the first model, the entered query is processed by matching it with a predefined set of rules or patterns. The
next step then translates it into a logical form according to what pattern it belongs to. From these rules, the
database query is directly formulated

 The Syntactic model in general presents linguistic information based on tokenizers, morph analyzers, part-of-
speech tagging (POS). There are eight parts of speech in the English grammar: verbs, nouns, pronouns,
adjectives, adverbs, prepositions, conjunctions and interjections

 The POS is not enough by itself to convert the Natural Language query into SQL, so we need to add more
information that we can use to understand the query. For this, we have used Stanford Named Entity Recognizer
(NER) in order to assign the keywords we already extracted from the query to the pre-defined category it belongs
to.

In the second model, a constituent syntactic tree using a syntactic parser is used, where the leaves are used in
the process of mapping to a database query based on predefined syntactic grammars. It starts with the Syntactic
analysis performed by Stanford POS tagger. Then, the keyword extractor use the information from the POS
tagger to extract the keywords that are used by the NER. The Named entity Recognizer defines the related
domain concepts like person or department. In complex queries we go through a dependency semantic parsing.
Then we move to the nodes mapping which maps each node in the keywords into the corresponding SQL
statement component. The SQL statement is executed against our relational database.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 38
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 39
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Ques 17 . Explain information retrieval and information extraction in NLP.

Ans . Information Extraction: Extraction means “pulling out” and Retrieval means “getting back.”
Information retrieval is about returning the information that is relevant for a specific query or field of interest
of the user. Information extraction is the standard process of taking data and extracting structured information
from it so that it can be used for various purposes, one of which may be in a search engine.

Information Retrieval: Information Retrieval refers to the human-computer interaction that happens when
we use a machine to search some piece of information for information objects (content) that match our search
query. It is all about retrieving information that is stored in a database or computer and related to the user’s
needs. A user’s query is matched against a set of documents to find the relevant documents.
. There are various methods and techniques used in information retrieval. In an information retrieval system,
we reduce information overload using an automated IR system.

 Precision: It is number of document retrieved and relevant to user’s information need divided by total
number of document that is retrieved.
 Recall: It is number of document retrieved and relevant to user’s information need divided by total
number of relevant document in whole document set.
Various techniques used in information retrieval are:

 Vector space retrieval


 Boolean space retrieval
 Term-document matrix
 Block-sort based indexing
 Tf-idf indexing
 Various clustering methods

Ques 18. (i) What are conversational agents? Give few examples of them. Describe architecture of a
general conversational software.
(ii) Mention some issues which may be observed in dialogue based agents.

Ans . Conversational agents, chatbots, dialogue systems and virtual assistants are some of the terms used by
scientific literature to describe software-based systems which are capable of processing natural language data
to simulate a smart conversational process with humans. These conversational mechanisms are built and driven
by a wide variety of techniques of different complexity, from traditional, pre-coded algorithms to emerging
adaptive machine learning algorithms. Usually deployed as service-oriented systems, they are designed to
assist users to achieve a specific goal based on their personal needs.
The use of natural language interfaces in the field of human-computer interaction is undergoing intense study
through dedicated scientific and industrial research. The latest contributions in the field, including deep learning

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 40
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design
approaches, have brought back the attention of the community to software-based dialogue systems, generally
known as conversational agents or chat-bots.
Five distinct traditions in dialogue systems research involving communities that have largely worked
independently of one another are as mentioned below:
 Text-based and Spoken Dialogue Systems.
 Voice User Interfaces.
 Chat bots.
 Embodied Conversational Agents.
 Social Robots and Situated Agents.

Examples of Dialogue Based Agent Applications


 Dialogue systems that appeared in the 1960s and 1970s were text-based. BASEBALL, SHRDLU and GUS are
some well-known examples. BASEBALL was a question-answering system that could answer questions about
baseball games. The system was able to handle questions with a limited syntactic structure and simply rejected
questions that it was not able to answer. SHRDLU was linguistically more advanced, incorporating a large
grammar of English, semantic knowledge about objects in its domain (a blocks world), and a pragmatic
component that processed nonlinguistic information about the domain. GUS was a system for booking flights
that was able to handle linguistic phenomena such as indirect speech acts and anaphoric reference. For
example, the utterance I want to go to San Diego on May 28 was interpreted as a request to make a
flight reservation, and the utterance the next flight was interpreted with reference to a previously
mentioned flight.
 Around the late 1980s and early 1990s, with the emergence of more powerful and more accurate speech
recognition engines, Spoken Dialogue Systems (SDSs) began to appear, such as: ATIS (Air Travel
Information Service) in the U.S.
 The DARPA Communicator systems were an exception as they investigated multi-domain dialogues
involving flight information, hotels, and car rentals These systems often suffered from speech
recognition errors and so a major focus was on avoiding miscommunication, for example, by employing
various strategies for error detection and correction.
 Around 2000 the emphasis in spoken dialogue systems research moved from handcrafted systems using
techniques from symbolic and logic-based AI to statistical, data-driven systems using machine learning

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 41
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

Fig: Present day Dialogue System

Following figure represents an architecture/components of a conversational agent software:

Fig: Architecture of a general conversational agent


A straightforward and well-known approach to dialogue system architecture is to build it as a chain of processes a
pipeline, where the system takes a user utterance as input and generates a system utterance as output In this chain, the
speech recognizer (ASR) takes a user’s spoken utterance.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 42
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(1) Transforms it into a textual hypothesis of the utterance


(2). The natural language understanding (NLU) component parses the hypothesis and generates a semantic
representation of the utterance
(3). This representation is then handled by the dialogue manager (DM), which looks at the discourse and
dialogue context to, for example, resolve anaphora and interpret elliptical utterances, and generates a response
on a semantic level.
(4). The natural language generation (NLG) component then generates a surface representation of the utterance
.(5) often in some textual form, and passes it to a text-to-speech synthesis (TTS) which generates the audio
output to the user

(ii) Issues related to dialogue based agents


 Issues related as prosodic analysis, discourse modeling, deep and surface language generation.
 Instead of passing all data along the pipe, it is possible to have a shared information storage or a
blackboard that all components write and read to, and from which they may subscribe to events.
 The components may send other messages, and not just according to this pipeline. For example, the
ASR may send messages about whether the user is speaking or not directly to the TTS.
 The components might operate asynchronously and incrementally.
 Asynchronicity means that, for example, the ASR may recognize what the user is saying while the DM
is planning the next thing to say.
 Instrumentality means that, for example, the ASR recognizes the utterance word by word as they are
spoken, and that the NLU component simultaneously parses these words.

Ques 19.Write short notes on the following:


(i) Advantages of natural language to data base interface.
(ii) Sub components of NLIDB

Ans . (i) Following are some of the advantages of natural language to data base interface:
 Provide high-level intelligent tools that provide new insights into the contents of the database by
extracting knowledge from data.
 Make information available to larger numbers of people because more people can now utilize the system
due to its ease of use.
 Improve the decision making process involved in using information after it has been retrieved by using
higher level information models.
 Interrelate information from different sources using different media so that the information is more easily
absorbed and utilized by the user.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 43
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

No Artificial Language: One advantage of NLIDBs is supposed to be that the user is not required to learn an artificial
communication language. Formal query languages like SQL are difficult to learn and master, at least by non-computer-
specialists
 Simple, easy to use: Consider a database with a query language or a certain form designed to
display the query. While an NLIDB system only requires a single input, a form-based may
contain multiple inputs (fields, scroll boxes, combo boxes, radio buttons, etc) depending on the
capability of the form
 Most of NLIDB systems provide some tolerances to minor grammatical errors, while in a
computer system; most of the time, the lexicon should be exactly the same as defined, the
syntax should correctly follow certain rules, and any errors will cause the input automatically
be rejected by the system

(ii)Problem of natural language access to a database is divided into two sub-components:


 Linguistic component
 Database component
Linguistic Component It is responsible for translating natural language input into a formal query and generating
a natural language response based on the results from the database search.
Database Component It performs traditional Database Management functions. A lexicon is a table that is used
to map the words of the natural input onto the formal objects (relation names, attribute names, etc.) of the
database. Both parser and semantic interpreter make use of the lexicon. A natural language generator takes the
formal response as its input, and inspects the parse tree in order to generate adequate natural language response.
Natural language database systems make use of syntactic knowledge and knowledge about the actual database
in order to properly relate natural language input to the structure and contents of that database.

Ques 20.Explain briefly approaches for the development of NLIDB Systems. Also describe the
architecture of NLIDB system.
Ans. Main approaches to develop a natural language intelligent database systems are as following:
(i) Symbolic Approach (Rule Based Approach): Natural Language Processing appears to be a
strongly symbolic activity. Words are symbols that stand for objects and concepts in real
worlds, and they are put together into sentences that obey well specified grammar rules.
Knowledge about language is explicitly encoded in rules or other forms of representation.
Language is analyzed at various levels to obtain information. On this obtained information
certain rules are applied to achieve linguistic functionality. As Human Language capabilities
include rule-based reasoning, it is supported well by symbolic processing. In symbolic
processing rules are formed for every level of linguistic analysis.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 44
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

(ii) Empirical Approach (Corpus Based Approach): Empirical approaches are based on
statistical analysis as well as other data driven analysis, of raw data which is in the form of
text corpora. A corpus is collections of machine readable text. Corpora are primarily used as
a source of information about language and a number of techniques have emerged to enable
the analysis of corpus data. Syntactic analysis can be achieved on the basis of statistical
probabilities estimated from a training corpus. Lexical ambiguities can be resolved by
considering the likelihood of one or another interpretation on the basis of context.
(iii) Connectionist Approach (Using Neural Network) : Since human language capabilities are
based on neural network in the brain, Artificial Neural Networks (also called as connectionist
network) provides on essential starting point for modeling language processing

Architecture of NLIDB

Most current NLIDBs first transform the natural language question into an intermediate logical query,
expressed in some internal meaning representation language. The intermediate logical query expresses the
meaning of the user’s question in terms of high level world concepts, which are independent of the database
structure. The logical query is then translated to an expression in the database’s query language, and evaluated
against the database The idea is to map a sentence into a logical query language first, and then further
translate this logical query language into a general database query language, such as SQL. In the process
there can be more than one intermediate meaning representation language

In the intermediate representation language approach, the system can be divided into two parts. One part
starts from a sentence up to the generation of a logical query. The other part starts from a logical query until the
generation of a database query. In the part one, the use of logic query languages makes it possible to add
reasoning capabilities to the system by embedding the reasoning part inside a logic statement. In addition,

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 45
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)

because the logic query languages is independent from the database, it can be ported to different database query
languages as well as to other domains, such as expert systems and operating systems

A semantic grammar system is very similar to the syntax based system, meaning that the query result is
obtained by mapping the parse tree of a sentence to a database query. The basic idea of a semantic grammar
system is to simplify the parse tree as much as possible, by removing unnecessary nodes or combining some
nodes together.

****************End of Unit -2*******************

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 46

You might also like