Professional Documents
Culture Documents
UNIT -2
Ans. The terms ‘semantics’ and ‘semantic interpretation’ usually refer to methods of representing the meanings of
natural language expressions, and of computing such meaning representations. A system for semantic analysis
determines the meaning of words in text. Semantics gives a deeper understanding of the text in sources such as
a blog post, comments in a forum, documents, group chat applications, chatbots, etc.
Ques 5. What are the conceptual tenses, state weights proposed by Schank?
Ans (i) Conceptual Tenses
past p
future f
negation /
start of a transition ts
end of a transition tf
present nil
conditional c
continuous k
interrogative ?
timeless ∞
Ques 8. How knowledge is represented using semantic nets? Explain with example. Mention few
advantages and disadvantages also.
Ans : Semantic networks structure the knowledge of a specific part of any information. It uses real-world
meanings that are easy to understand. For example, it uses "is a" and "is a part" inheritance hierarchies. Besides
this, they can be used to represent events and natural language sentences. The semantic network based
knowledge representation mechanism is useful where an object or concept is associated with many attributes
and where relationships between objects are important. Semantic nets have also been used in natural language
research to represent complex sentences expressed in English.
It represents knowledge in the form of graphs with the help of interconnected nodes. It's a widely
popular idea in artificial intelligence and natural language processing because it supports reasoning.
A semantic network is an alternative way to represent knowledge in a logical way.
Arcs show the relationships between objects.
Ques 9. Eplain about partitioned nets and their properties. How information is deduced through
partitioned nets?
Ans. Partitioned Semantic Network : Some complex sentences are there which cannot be represented by simple
semantic nets and for this we have to follow the technique partitioned semantic networks. Partitioned semantic
net allow for
1. Propositions to be made without commitment to truth.
2. Expressions to be quantified (Universal or Existential Quantification)
In partitioned semantic network, the network is broken into spaces which consist of groups of nodes and arcs
and regard each space as a node.
Examples: (i) All Seema are eating Apples
In above examples GS stands for general statement of real world which is true for all the instances of a class
which hold some properties. GS has the form of another semantic net which shoes the relationship between
objects w.r.t some event which is occurred.
The g is an instance of GS which universally quantifies the event.
(b).Paraphrasing Task: Given some pairs of texts as input, the task of paraphrasing is to recognize whether each pair of
texts captures a paraphrase/semantic equivalence relationship. A paraphrase is the restatement of the meaning of a text
using different words. The output is binary (paraphrase or non-paraphrase).
(c)Fact-checking : Given some claims as input, the task of fact-checking is to check the factuality of these claims.
Factuality indicates the degree of being actual in terms of right or wrong. The predefined output can be binary (e.g., true
or false) or multi-class (e.g., true, false, half-true, mostly-true, etc.)
(d) Relation Extraction : Given some texts and entities from the texts as input, the task of relation extraction is to
identify the semantic relations between two or more entities. The semantic relations are predefined, considering direction.
The direction of relation means who modifies what
Ques 10. How humans can help a machine translation system to produce better quality. What are various
methods of machine translation?
Ans. We can adopt following practices to produce better quality in machine translation systems.
(a).RULE BASED MACHINE TRANSLATION (RBMT): RBMT is called Knowledge Based Machine
Translation that retrieves rules from bilingual dictionaries a grammars based on linguistic information about source and
target languages. RBMT generates target sentences on the basis of syntactic, morphological and semantic regularities of
each language. It converts source language structures to target language structures and it is extensible and maintainable .
There are three types of RBMT systems:
(i) Direct translation systems
Direct method (Dictionary Based Machine Translation): Source language text are translated without passing through
an intermediary representation. The words will be translated as a dictionary does word by word, usually without much
correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis.
Anusaarka is the example of system that uses direct approach. Indian Institute of Information Technology,
Hyderabad, develops it.
Transfer Rules Based Machine Translation Systems: Morphological and syntactical analysis is the
fundamental approaches in Transfer based systems. Here source language text is converted into less language
specific representation and same level of abstraction is generated with the help of grammar rules and bilingual
dictionaries.
In the transfer approach of translation divergence, there is transfer rule for transforming a source language (SL)
sentence into target language (TL), by performing lexical and structural manipulations Mantra is a transfer
based tool which is a funded project of India Government
Interlingual RBMT Systems (Interlingua) This model is indented to make linguistic homogeneity across the world. In
this method, source language is translated into an intermediary representation which does not depends on any languages.
Target language is derived from this auxiliary form of representation.
Corpus based machine translation : One of the main methods of machine translation is Corpus Based Machine
Translation because high level of accuracy is achieved at the time of translation by this method. Large volumes of
translations are presented after the development of corpus based system that is used in various computer-aided translation
applications. Following is the different types of Corpus Based Machine Translation models.
(a).Statistical Machine Translation (SMT) : Statistical models are applied in this method to create translated output
with the assistance of bilingual corpora. The concept of Statistical Machine Translation comes from information theory.
The important feature of this method is no customization work is required by linguists because the tool learns translation
methods through statistical analysis of bilingual corpora
(b).Example based Machine Translation System: This method is also called as Memory based translation in which set
of sentences from source language is given and generates corresponding translations of target language with point to point
mapping. Here examples are used to convert similar types of sentences and previously translated sentence repeated, the
same translation is likely to be correct again . The main advantage of this model is it work well with small set of data and
possible to generate output more quickly by train the translation program. Example based method is mainly used to
translate two totally different languages like Japanese and English as in. It is not possible to apply deep linguistic analysis
that is one of the main drawbacks of Example based engine. PanEBMT is an example of EBMT tool.
(c).Hybrid Machine Translation: HMT takes the advantages of RBMT and Statistical Machine Translation. It uses
RBMT as baseline and refines the rules through statistical models. Rules are used to pre-process data in an attempt to
better guide the statistical engine. Hybrid model differ in various ways.
Rules Post-Processed by Statistics : Rule based tool is used for translation at first. Statistical model is applied to
adjust the translated output of rule based tool.
Statistics Guided by Rules In this method, rules are applied to pre-process input that gives better guidance to
statistical tool. Rules are also used to post-process the statistical output that caused to normalized output. This
B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)
Page 29
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 2)
method has more flexibility, power and control at the translation time.
Ans. Frame based technique is used ti optimize semantic networks by incorporating reusability in language
system via inheritance. Frame based representation is a development of semantic nets and allow us to express the idea
of inheritance. A Frame System consists of a set of frames (or nodes), which are connected together by relations. Each
frame describes either an instance or a class. Each frame has one or more slots, which are assigned slot values. This is the
way in which the frame system is built up. Rather than simply having links between frames, each relationship is expressed
by a value being placed in a slot. Example:
When we say, “Tommy is a dog” we really mean, “Tommy is an instance of the class dog” or “Tommy is a
member of the class dogs”. Why are Frames useful? The main advantage of using frame-based systems for
expert systems is that all information about a particular object is stored in one place(optimization takes place).
Ques 12 . Explain the goal of conceptual dependency. What are its components and primitive actions?
Ans : Conceptual Dependency originally developed to represent knowledge acquired from natural language
input. This was proposed by Schank. The goals of this theory are:
Ques 13. What are various conceptual categories, conceptual roles and conceptual syntax rules applied
in conceptual dependency.
Conceptual roles
Conceptualization: The basic unit of the conceptual level of understanding.
Actor: The performer of an ACT.
ACT: An action done to an object.
Object: A thing that is acted upon.
Recipient: The receiver of an object as the result of an ACT.
Direction: The location that an ACT is directed toward.
State: The state that an object is in.
Ans . Corpus is a large collection of authentic text (i.e., samples of language produced in genuine
communicative situations), and corpus linguistics as any form of linguistic inquiry based on data derived from
such a corpus.
A corpus can be defined as a systematic collection of naturally occurring texts (of both written and spoken
language). “Systematic” means that the structure and contents of the corpus follows certain extra-linguistic
principles (“sampling principles”, i.e. principles on the basis of which the texts included were chosen).
For example, a corpus is often restricted to certain text types, to one or several varieties of English, and to a
certain time span. If several subcategories (e.g. several text types, varieties etc.) are represented in a corpus,
these are often represented by the same amount of text. “Systematic” also means that information on the exact
composition of the corpus is available to the researcher (including the number of words in each category and
in the whole corpus, how the texts included in the corpus were sampled etc).
The four major points of criticism leveled at the use of corpus data in linguistic research are the
following:
1. Corpora are usage data and thus of no use in studying linguistic knowledge.
2. Corpora and the data derived from them are necessarily incomplete;
3. Corpora contain only linguistic forms (represented as graphemic strings), but no information about the
semantics, pragmatics, etc. of these forms; and
4. Corpora do not contain negative evidence, i.e., they can only tell us what is possible in a given language, but
not what is not possible.
Corpus Representativeness
Representativeness is a defining feature of corpus design. The following definitions from two great researchers
Leech and Biber, will help us understand corpus representativeness −
According to Biber (1993), “Representativeness refers to the extent to which a sample includes the full
range of variability in a population”.
Corpus Balance : This is defined as the range of genre included in a corpus. A balanced corpus covers a wide
range of text categories, which are supposed to be representatives of the language. We do not have any reliable
scientific measure for balance but the best estimation and intuition works in this concern.
Sampling: Another important element of corpus design is sampling. Corpus representativeness and balance is
very closely associated with sampling. That is why we can say that sampling is inescapable in corpus building.
According to Biber(1993), “Some of the first considerations in constructing a corpus concern the
overall design: for example, the kinds of texts included, the number of texts, the selection of particular
texts, the selection of text samples from within texts, and the length of text samples. Each of these
involves a sampling decision, either conscious or not.”
Sampling unit − It refers to the unit which requires a sample. For example, for written text, a sampling
unit may be a newspaper, journal or a book.
Population − It may be referred as the assembly of all sampling units. It is defined in terms of language
production, language reception or language as a product.
Corpus Size : It defines how large the corpus should be? There is no specific answer to this question. The size
of the corpus depends upon the purpose for which it is intended as well as on some practical considerations as
follows :
Tree Bank Corpus : It may be defined as linguistically parsed text corpus that annotates syntactic or semantic
sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the most common way of
representing the grammatical analysis is by means of a tree structure. Generally, Treebanks are created on the
top of a corpus, which has already been annotated with part-of-speech tags.
Semantic Treebanks : These Tree banks use a formal representation of sentence’s semantic structure.
They vary in the depth of their semantic representation..
Syntactic Tree banks : Opposite to the semantic Tree banks, inputs to the Syntactic Treebank systems
are expressions of the formal language obtained from the conversion of parsed Treebank data. The
outputs of such systems are predicate logic based meaning representation.
Prop Bank Corpus: Prop Bank more specifically called “Proposition Bank” is a corpus, which is
annotated with verbal propositions and their arguments. The corpus is a verb-oriented resource; the
annotations here are more closely related to the syntactic level.
VerbNet(VN): Verb Net(VN) is the hierarchical domain-independent and largest lexical resource present in
English that incorporates both semantic as well as syntactic information about its contents.
WordNet: Word Net, created by Princeton is a lexical database for English language. It is the part of the NLTK
corpus. In Word Net, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms
called Synsets. All the synsets are linked with the help of conceptual-semantic and lexical relations.
Ques 16. How is intelligent database created for NLP system? Explain with example.
Ans : The importance of NLIDB system is that it makes it easy for users with no programming experience to
deal with a database.User can just use Natural Language to interact with a database which is very simple and
easy. Also, the user would not need a special training on using such systems (maybe some training to know the
interface). The user is not forced to learn any database language.
In general, the existing methods to interact with a database using NLP can be divided into three categories:
(1) Pattern Matching Models or Template Based Approach.
(2) Syntactic Models.
(3) Syntactic and Semantic Models.
In the first model, the entered query is processed by matching it with a predefined set of rules or patterns. The
next step then translates it into a logical form according to what pattern it belongs to. From these rules, the
database query is directly formulated
The Syntactic model in general presents linguistic information based on tokenizers, morph analyzers, part-of-
speech tagging (POS). There are eight parts of speech in the English grammar: verbs, nouns, pronouns,
adjectives, adverbs, prepositions, conjunctions and interjections
The POS is not enough by itself to convert the Natural Language query into SQL, so we need to add more
information that we can use to understand the query. For this, we have used Stanford Named Entity Recognizer
(NER) in order to assign the keywords we already extracted from the query to the pre-defined category it belongs
to.
In the second model, a constituent syntactic tree using a syntactic parser is used, where the leaves are used in
the process of mapping to a database query based on predefined syntactic grammars. It starts with the Syntactic
analysis performed by Stanford POS tagger. Then, the keyword extractor use the information from the POS
tagger to extract the keywords that are used by the NER. The Named entity Recognizer defines the related
domain concepts like person or department. In complex queries we go through a dependency semantic parsing.
Then we move to the nodes mapping which maps each node in the keywords into the corresponding SQL
statement component. The SQL statement is executed against our relational database.
Ans . Information Extraction: Extraction means “pulling out” and Retrieval means “getting back.”
Information retrieval is about returning the information that is relevant for a specific query or field of interest
of the user. Information extraction is the standard process of taking data and extracting structured information
from it so that it can be used for various purposes, one of which may be in a search engine.
Information Retrieval: Information Retrieval refers to the human-computer interaction that happens when
we use a machine to search some piece of information for information objects (content) that match our search
query. It is all about retrieving information that is stored in a database or computer and related to the user’s
needs. A user’s query is matched against a set of documents to find the relevant documents.
. There are various methods and techniques used in information retrieval. In an information retrieval system,
we reduce information overload using an automated IR system.
Precision: It is number of document retrieved and relevant to user’s information need divided by total
number of document that is retrieved.
Recall: It is number of document retrieved and relevant to user’s information need divided by total
number of relevant document in whole document set.
Various techniques used in information retrieval are:
Ques 18. (i) What are conversational agents? Give few examples of them. Describe architecture of a
general conversational software.
(ii) Mention some issues which may be observed in dialogue based agents.
Ans . Conversational agents, chatbots, dialogue systems and virtual assistants are some of the terms used by
scientific literature to describe software-based systems which are capable of processing natural language data
to simulate a smart conversational process with humans. These conversational mechanisms are built and driven
by a wide variety of techniques of different complexity, from traditional, pre-coded algorithms to emerging
adaptive machine learning algorithms. Usually deployed as service-oriented systems, they are designed to
assist users to achieve a specific goal based on their personal needs.
The use of natural language interfaces in the field of human-computer interaction is undergoing intense study
through dedicated scientific and industrial research. The latest contributions in the field, including deep learning
approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design
approaches, have brought back the attention of the community to software-based dialogue systems, generally
known as conversational agents or chat-bots.
Five distinct traditions in dialogue systems research involving communities that have largely worked
independently of one another are as mentioned below:
Text-based and Spoken Dialogue Systems.
Voice User Interfaces.
Chat bots.
Embodied Conversational Agents.
Social Robots and Situated Agents.
Ans . (i) Following are some of the advantages of natural language to data base interface:
Provide high-level intelligent tools that provide new insights into the contents of the database by
extracting knowledge from data.
Make information available to larger numbers of people because more people can now utilize the system
due to its ease of use.
Improve the decision making process involved in using information after it has been retrieved by using
higher level information models.
Interrelate information from different sources using different media so that the information is more easily
absorbed and utilized by the user.
No Artificial Language: One advantage of NLIDBs is supposed to be that the user is not required to learn an artificial
communication language. Formal query languages like SQL are difficult to learn and master, at least by non-computer-
specialists
Simple, easy to use: Consider a database with a query language or a certain form designed to
display the query. While an NLIDB system only requires a single input, a form-based may
contain multiple inputs (fields, scroll boxes, combo boxes, radio buttons, etc) depending on the
capability of the form
Most of NLIDB systems provide some tolerances to minor grammatical errors, while in a
computer system; most of the time, the lexicon should be exactly the same as defined, the
syntax should correctly follow certain rules, and any errors will cause the input automatically
be rejected by the system
Ques 20.Explain briefly approaches for the development of NLIDB Systems. Also describe the
architecture of NLIDB system.
Ans. Main approaches to develop a natural language intelligent database systems are as following:
(i) Symbolic Approach (Rule Based Approach): Natural Language Processing appears to be a
strongly symbolic activity. Words are symbols that stand for objects and concepts in real
worlds, and they are put together into sentences that obey well specified grammar rules.
Knowledge about language is explicitly encoded in rules or other forms of representation.
Language is analyzed at various levels to obtain information. On this obtained information
certain rules are applied to achieve linguistic functionality. As Human Language capabilities
include rule-based reasoning, it is supported well by symbolic processing. In symbolic
processing rules are formed for every level of linguistic analysis.
(ii) Empirical Approach (Corpus Based Approach): Empirical approaches are based on
statistical analysis as well as other data driven analysis, of raw data which is in the form of
text corpora. A corpus is collections of machine readable text. Corpora are primarily used as
a source of information about language and a number of techniques have emerged to enable
the analysis of corpus data. Syntactic analysis can be achieved on the basis of statistical
probabilities estimated from a training corpus. Lexical ambiguities can be resolved by
considering the likelihood of one or another interpretation on the basis of context.
(iii) Connectionist Approach (Using Neural Network) : Since human language capabilities are
based on neural network in the brain, Artificial Neural Networks (also called as connectionist
network) provides on essential starting point for modeling language processing
Architecture of NLIDB
Most current NLIDBs first transform the natural language question into an intermediate logical query,
expressed in some internal meaning representation language. The intermediate logical query expresses the
meaning of the user’s question in terms of high level world concepts, which are independent of the database
structure. The logical query is then translated to an expression in the database’s query language, and evaluated
against the database The idea is to map a sentence into a logical query language first, and then further
translate this logical query language into a general database query language, such as SQL. In the process
there can be more than one intermediate meaning representation language
In the intermediate representation language approach, the system can be divided into two parts. One part
starts from a sentence up to the generation of a logical query. The other part starts from a logical query until the
generation of a database query. In the part one, the use of logic query languages makes it possible to add
reasoning capabilities to the system by embedding the reasoning part inside a logic statement. In addition,
because the logic query languages is independent from the database, it can be ported to different database query
languages as well as to other domains, such as expert systems and operating systems
A semantic grammar system is very similar to the syntax based system, meaning that the query result is
obtained by mapping the parse tree of a sentence to a database query. The basic idea of a semantic grammar
system is to simplify the parse tree as much as possible, by removing unnecessary nodes or combining some
nodes together.