You are on page 1of 23

Unit # 2

Syntactic Analysis
Concept of Grammar
Phases of Natural Language Processing
Stem ,
Morphem ,
Morphology POS
Analysis

Natural Grammar
Pragmatic Syntactic
language Rules
Analysis Analysis
Processing

Contextual
Information
Semantic
Analysis
Semantic
Rules
What is Syntactic Analysis
Syntactic analysis, or parsing, is the process of analyzing natural language with the rules of a
formal grammar. Grammatical rules are applied to categories and groups of words, NOT
individual words. Syntactic analysis basically assigns a semantic structure to text.

• Use of Noun-Verb pair: A sentence includes a subject and a predicate. We combine every
noun phrase with a verb phrase in the sentence.
Example: The dog (noun phrase) went away (verb phrase)
• Adjective before Noun: Adjectives are usually placed before the noun they describe.
Example: The beautiful garden was blooming with flowers.
• Use of Articles:'A' or 'an' is used before singular, countable nouns that are not specific; 'the' is
used before specific nouns.
Example: A cat sat on the mat. (any cat)
Example: The cat sat on the mat. (a specific cat)
• Proper Placement of Modifiers: Modifiers should be placed next to the word they modify.
Example: She almost drove six hours to get home.
• Pronoun Antecedent Agreement: Pronouns must agree with their antecedents in number and
gender.
Example: Every student must bring his or her own pencil.
• Subject-Verb Agreement: A singular subject takes a singular verb, while a plural subject
takes a plural verb.
Example: The dog barks. (singular)
Example: The dogs bark. (plural)
Chomsky Hierarchy of Grammar
• The field of formal language theory (FLT) initiated by Noam Chomsky sets a minimal
limit on description adequacy.
• Chomsky approach ignores meaning , usage of expressions ,frequency, context
dependence and processing complexity entirely from the natural language.
• Chomsky theory only assume that patterns that are productive for short strings apply to
strings of arbitrary length in an unrestricted way.
• An expression in the sense of FLT is simply a finite string of symbols, and a
(formal) language is a set of such strings. Chomsky theory explores the mathematical and
computational properties of such sets.
• The immense success of his framework influence not only linguistics but also theoretical
computer science and molecular biology.
• Particularly , FLT deals with formal languages (= sets of strings) that is defined by a finite
set of rules – Grammar (𝒢).
• Grammar in FLT is composed of four elements :
(1) a finite vocabulary of symbols (Σ), referred to as terminals , that appear in the
strings of the language
(2) finite vocabulary of extra symbols called non-terminals (NT)
(3) a special designated non-terminal called the start symbol (S)
(4) and a finite set of rules. (R)

Thus a grammar 𝒢 can be referred to a quadruple 〈Σ,NT,S,R 〉


Chomsky Hierarchy of Grammar (contd)
• 𝒢 will be said to generate a string consisting of symbols from Σ if and only if it is possible
to start with S through some finite sequence of rule applications.
• The set of all strings that 𝒢 can generate is called the language of 𝒢, and is notated L(𝒢) .

Chomsky classified grammar hierarchy into four levels or categories based on their
generative power:
1. Type 0 - Unrestricted Grammars
2. Type 1 - Context-Sensitive Grammars
3. Type 2 - Context-Free Grammars
4. Type 3 - Regular Grammars
Type 0 - Unrestricted Grammar
• The productions can be in the form of α → β where α is a string of terminals and non
terminals with at least one non-terminal and α cannot be null. β is a string of terminals and
non-terminals. Examples S → ACaB
Bc → acB
CB → DB
• Type 0 - Unrestricted Grammars are the most powerful in the Chomsky hierarchy and
capable of generating any language. This level of generative power allows for the
description of languages and behaviors that are highly complex and encompass all
other grammar types within the hierarchy.
• The sky's truly the limit here.
Chomsky Hierarchy of Grammar (contd)
• Type 0 grammars are not typically used in natural language processing (NLP) due to
their computational complexity and lack of constraints.
Example : The cat chases the mouse.
"Chases mouse the cat." or even "The the mouse cat chases."

Type 1 - Context-Sensitive Grammar


• The productions can be in the form of α → β with condition that len(α ) <= len(β)
• Type-1 grammars in the Chomsky hierarchy are more restrictive than Type-0 grammars but
less restrictive than Type-2 (context-free grammars) and Type-3 (regular grammars).
• The format ensures that the context around non-terminal A (represented by α and β) can
dictate the substitution of α into β, making the grammar context-sensitive.
• This rule translate in English language to rules like say, agreement in number between
subjects and verbs in sentences. The English grammar rule ensure that a singular subject
matches with a singular verb form and a plural subject with a plural verb form, which can be
considered within the context of the surrounding words.
Chomsky Hierarchy of Grammar (contd)
• So in short , the previous example now has a Rule to follow .
Previous Example : The cat chases the mouse.
Singular Subject with Singular Verb:
"The cat chases the mouse."
"Cat" is a singular noun, so the verb "chases" is also in the singular form.
Plural Subject with Plural Verb:
"The cats chase the mouse."
"Cats" is a plural noun, so the verb "chase" is in the plural form, without the 's'
at the end.

Thus rule must be followed to construct grammatically correct sentences in Chomsky Type-
1 Context Sensitive Grammar.

To describe the grammar associated with this example , we have a set of production rules.
These rules explain how sentences in the language are constructed from words and phrases.
Chomsky Hierarchy of Grammar (contd)
Some more Rules as illustration – English Grammar (for other languages ... xyz ….)
Pronoun Antecedent Agreement:
Rule: Pronouns must agree in number and gender with their antecedents.
Every student must bring his or her pencil.

Use of articles :
Rule: The definite article 'the' is used before a noun that is specific or known to the
listener, while 'a' or 'an' is used for non-specific nouns in the singular form.
She wants an apple from the basket.

Rule: The subjunctive


Subjunctive Mood: mood is used for wishes, hypotheticals, or actions that are
contrary to fact.
If I were you, I would not do that.

• These rules illustrate how the context surrounding words or phrases can dictate the
appropriate grammatical forms to use, which is a hallmark of context-sensitive (Type-
1) grammars.
• Starting from a string in question β, there are finitely many ways in which rules can be
applied backward to it.
Chomsky Hierarchy of Grammar (contd)
Type 2 - Context-free Grammar
Chomsky Type-2 Grammar, also known as context-free grammar (CFG), is a formal
grammar in which every production rule is of the form α → β where α is a single non-
terminal symbol, and β is a string of terminals and/or non-terminals (β can be empty). The
productions need NOT follow condition that len(α ) <= len(β) instead
- Every string has an equal number of 'α's and 'β's, but in any order which yields a context-
free grammar. ab → ba
aabb → bbaa
CB → DB
- Further, it follows a hierarchical structure i.e it consists a set of production rules that can
be applied recursively and can generate a tree structure.
The hierarchical structure refers to the way sentences can be broken down into smaller parts,
and those parts can be broken down further, following the CFG rules. This leads to the
creation of a parse tree, which visually represents the breakdown of a sentence into its
grammatical parts.
In a parse tree for a context-free grammar:
The root node is typically the start symbol (often S for sentence).
The leaf nodes are terminal symbols, which correspond to the words of the sentence.
The interior nodes are non-terminal symbols, representing the syntactic categories (like noun
phrases, verb phrases, etc.).
Chomsky Hierarchy of Grammar (contd)
For the sentence "The cat chases the mouse.", we define a context-free rule as follows:
S→NPsingular VPsingular
S
​NPsingular →Det N singular
/ \
VPsingular → Vsingular NP NP VP
/ \ / \
1. Start with the Sentence (S): Det N V NP
The initial rule identifies the sentence structure: | | | / \
S→NP VP The cat chases Det N
| |
2. Expand the Noun Phrase (NP) for the Subject: The mouse
Here, we expand the noun phrase to include a
determiner (Det) and a singular noun (N_singular):
NP→Det Nsingular
​ "The cat": NP→[The][cat]

The tree shows the hierarchical structure of the sentence. The sentence is divided into a noun
phrase and a verb phrase. The noun phrase NP consists of a determiner Det ("The") and a noun N
("cat"), which together refer to the subject of the sentence. The verb phrase VP consists of a verb
V ("chases") and a noun phrase NP, which is the object of the sentence. This object NP is again
made up of a determiner "The" and a noun "mouse".
Chomsky Hierarchy of Grammar (contd)
Type 3 - Regular Grammar
• Chomsky's Type-3 Grammar, also known as Regular Grammar, is the simplest type of
grammar in the Chomsky hierarchy.

• The production rules in a Type-3 grammar are restricted to a single non-terminal on the
left side and on the right side either a single terminal or a terminal followed by a non-
terminal.
α → β or α → β Y He talks, She runs
where α, Y ∈ N (Non terminal) Quickly", "Happily
and β ∈ T (Terminal) Unhappy , Happiness

• The Type-3 Grammar are suitable for describing the simplest syntactic structures that
involve direct adjacency and do not require nesting or recursion.
• It does not allow hierarchical structure or much nesting or recursion, unlike context-free
grammars.
• For each regular grammar 𝒢, it is possible to construct an algorithm (a FSA) that reads a
string from left to right, and then outputs ‘yes’ if the string belongs to L( 𝒢), and ‘no’
otherwise.
Conclude Chomsky Grammar
• A simple sentences built up in a hierarchical fashionS
from smaller parts to the complete sentence. ├── NP
│ ├── Det
│ │ └── The
• This hierarchical structure is critical for
│ ├── Adj
understanding the syntactic function of each word │ │ └── quick
and phrase within a sentence. │ ├── Adj
• It shall allow the analysis and generation of
│ │ └── brown
syntactically correct sentences in natural language │ └── N
processing. │ └── fox
└── VP
├── V
│ └── jumps
├── P
│ └── over
└── NP
├── Det
│ └── the
├── Adj
3. Where are Natural Language located ? │ └── lazy
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367686/ └── N
└── dog
Semantic
Analysis

Syntax
Lexical Analysis
Analysis

Code
Generation
Code
Optimisation

Concept of Parsing
Parsing in NLP
• Parsing in basic terms can be described as breaking down the sentence into its constituent
words in order to find out the grammatical type of each word or alternatively to decompose
an input into more easily processed components.
• Every natural language consist of its own grammar rules according to which the sentences are
formed. Parsing is used to find out the sequence of rules applied for sentence generation in
that particular language.
• The basic connection between a sentence and the grammar is derived from the parse tree.
Natural Language processing provides us with two basic parsing techniques viz, Top-Down
and Bottom-Up. Their name describes the direction in which parsing process advances.

Top-Down parsing
• The process involves predicting the structure of a sentence from the start symbol of the
grammar down to the terminals, which correspond to the words in the sentence.
• The start symbol S represents the most general concept, typically a sentence in natural
language grammars.
• The algorithm starts from the tops of the tree i.e S, by looking on the grammar rules with S
on left hand side so that all the possible trees are generated.
Top-Down parsing
• The algorithm proceeds by substituting the start symbol with one of its possible
expansions (productions). This prediction is guided by the grammar rules, which define
how symbols can be replaced or expanded.
• The process is recursive; for each non-terminal symbol encountered, the parser selects a
production rule to expand it further, moving towards the terminal symbols.
• This expansion continues until the parser reaches the terminal symbols, which are the
actual words or tokens of the input sentence.
• If the parser selects a production that doesn't lead to a successful match with the input
sentence, it may need to backtrack. Backtracking involves going back up the parse tree to
a previous decision point and trying a different production rule.
• This can be computationally expensive in cases where many backtracks are necessary.
• The goal of top-down parsing is to construct a parse tree that represents the syntactic
structure of the input sentence according to the grammar. If the entire input sentence is
successfully matched against the productions of the grammar, the sentence is considered
syntactically valid.
• Top-down parsing can be implemented in various forms,
o The simplest being a Recursive Descent Parser
o Predictive Parser
Top-Down parsing (contd)
Recursive Descent Parser
• Recursive descent parsing is one of the most straightforward forms of parsing.
• This parser checks the syntax of the input stream of text by reading it from left to right
(hence, it is also known as the Left-Right Parser).
• The parser first reads a character from the input stream and then verifies it or matches it
with the grammar's terminals. If the character is verified, it is accepted else it gets rejected.
• Recursive descent parsers are straightforward to implement and can handle a wide range
of grammars, including those that are not context-free.
• Since the grammar in parser is manually coded, it can include sophisticated error reporting
and recovery mechanisms. Consider
expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/') factor)*
<h1> , <b>,
factor ::= NUMBER | '(' expression ')‘
<head> , <html> ,
<img>
Predictive Parser
• The Predictive Parser is a type of top-down parser that is specifically designed to work
with a class of grammars known as LL grammars, where the first "L" stands for scanning
the input from left to right, and the second "L" for producing a leftmost derivation.
Top-Down parsing (contd)
Grammar Rule :
The basic sentence is understood in terms of noun phrase NP and verb phrase VP.
Other rules lets say are stated as below :

S -> NP VP # S indicate the entire sentence


VP -> V NP # VP is verb phrase the
V -> "eats" | "drinks" # V is verb
NP -> Det N # NP is noun phrase (chunk that has noun in it)
Det -> "a" | "an" | "the" # Det is determiner used in the sentences
N -> "president" |"Obama" |"apple"| "coke" # N some example nouns

• President eats apple


• Obama drinks coke
Bottom-Up parsing
• The bottom-up parsing approach follows the leaves-to-root approach or technique. In the
bottom-up parsing approach, the construction of the parse tree starts from the leaf node.
So, first, the leaf node is formed and then the generation goes subsequently up by
generating the parent node, and finally, the root node is generated .
• The bottom up parser begins with the input sentence, treating each word as a basic unit or
leaf node in the parse tree.
• It then looks for sequences of nodes that match the right-hand side of a grammar rule.
When it finds such a match obeying the rule , it effectively construct a higher-level node in
the parse tree.
• This process of matching and replacing continues iteratively, building up the tree from the
leaves (input symbols) towards the root (the start symbol).
• The parsing is successful if the entire input can be reduced to the start symbol of the
grammar, indicating that the sentence conforms to the specified grammar.
• The most common type of bottom-up parser are
o Shift-reduce Parser
o LR Parser
Bottom-Up parsing (contd)
Shift-reduce Parser
• A shift-reduce parser is a sort of bottom-up parser that starts with the input and builds a
parse tree by performing a series of shift (transfer data to the stack) and reduction (apply
grammar rules) operations.
S -> NP VP
VP -> V NP
V -> "eats" | "drinks"
NP -> Det N
Det -> "a" | "an" | "the
N -> "president" |"Obama" |"apple"| "coke"

sentence = Obama eats an apple

• Initially, the parser shifts each word of the sentence onto a stack, one word at a time,
starting from "Obama".
• When the items on the stack match the right side of a grammar rule, the parser reduces
those items into a single item based on the rule. For example, after shifting "Obama", it
matches the rule N -> 'Obama', so "Obama" is reduced to N.
Bottom-Up parsing (contd)
Shift "Obama" onto the stack. (Stack: [Obama])
Reduce "Obama" to N using the rule N -> 'Obama'. (Stack: [N])
Reduce N to NP using the rule NP -> N. (Stack: [NP])
Shift "eats" onto the stack. (Stack: [V, eats])
Reduce "eats" to VP using the rule VP -> V -> 'eats'.
Shift "an" onto the stack.
Reduce "an" to Det and further using the rule Det -> NP ->VP
Reduce "apple" to N using the rule N -> NP ->VP
Reduce N to NP using the rule NP -> Det N.
Reduce NP VP to S using the rule S -> VP NP and NP -> Det N
Bottom-Up parsing (contd)
• This process continues with shifting and reducing according to the rules defined in the
grammar until the entire sentence is reduced to the start symbol (S), indicating
successful parsing.
• The ShiftReduceParser might not always find a parse for a sentence, especially if the
grammar is ambiguous or doesn't cover the sentence structure. In such cases, we need to
adjust the grammar.

• Apple eats coke


• President drinks Obama

When it comes to a syntactic parser, there is a chance that a syntactically formed sentence could
be meaningless. To get to the semantics, we need a deeper
understanding of semantics structure of the sentence.
Thanks

You might also like