Professional Documents
Culture Documents
Syntactic Analysis
Concept of Grammar
Phases of Natural Language Processing
Stem ,
Morphem ,
Morphology POS
Analysis
Natural Grammar
Pragmatic Syntactic
language Rules
Analysis Analysis
Processing
Contextual
Information
Semantic
Analysis
Semantic
Rules
What is Syntactic Analysis
Syntactic analysis, or parsing, is the process of analyzing natural language with the rules of a
formal grammar. Grammatical rules are applied to categories and groups of words, NOT
individual words. Syntactic analysis basically assigns a semantic structure to text.
• Use of Noun-Verb pair: A sentence includes a subject and a predicate. We combine every
noun phrase with a verb phrase in the sentence.
Example: The dog (noun phrase) went away (verb phrase)
• Adjective before Noun: Adjectives are usually placed before the noun they describe.
Example: The beautiful garden was blooming with flowers.
• Use of Articles:'A' or 'an' is used before singular, countable nouns that are not specific; 'the' is
used before specific nouns.
Example: A cat sat on the mat. (any cat)
Example: The cat sat on the mat. (a specific cat)
• Proper Placement of Modifiers: Modifiers should be placed next to the word they modify.
Example: She almost drove six hours to get home.
• Pronoun Antecedent Agreement: Pronouns must agree with their antecedents in number and
gender.
Example: Every student must bring his or her own pencil.
• Subject-Verb Agreement: A singular subject takes a singular verb, while a plural subject
takes a plural verb.
Example: The dog barks. (singular)
Example: The dogs bark. (plural)
Chomsky Hierarchy of Grammar
• The field of formal language theory (FLT) initiated by Noam Chomsky sets a minimal
limit on description adequacy.
• Chomsky approach ignores meaning , usage of expressions ,frequency, context
dependence and processing complexity entirely from the natural language.
• Chomsky theory only assume that patterns that are productive for short strings apply to
strings of arbitrary length in an unrestricted way.
• An expression in the sense of FLT is simply a finite string of symbols, and a
(formal) language is a set of such strings. Chomsky theory explores the mathematical and
computational properties of such sets.
• The immense success of his framework influence not only linguistics but also theoretical
computer science and molecular biology.
• Particularly , FLT deals with formal languages (= sets of strings) that is defined by a finite
set of rules – Grammar (𝒢).
• Grammar in FLT is composed of four elements :
(1) a finite vocabulary of symbols (Σ), referred to as terminals , that appear in the
strings of the language
(2) finite vocabulary of extra symbols called non-terminals (NT)
(3) a special designated non-terminal called the start symbol (S)
(4) and a finite set of rules. (R)
Chomsky classified grammar hierarchy into four levels or categories based on their
generative power:
1. Type 0 - Unrestricted Grammars
2. Type 1 - Context-Sensitive Grammars
3. Type 2 - Context-Free Grammars
4. Type 3 - Regular Grammars
Type 0 - Unrestricted Grammar
• The productions can be in the form of α → β where α is a string of terminals and non
terminals with at least one non-terminal and α cannot be null. β is a string of terminals and
non-terminals. Examples S → ACaB
Bc → acB
CB → DB
• Type 0 - Unrestricted Grammars are the most powerful in the Chomsky hierarchy and
capable of generating any language. This level of generative power allows for the
description of languages and behaviors that are highly complex and encompass all
other grammar types within the hierarchy.
• The sky's truly the limit here.
Chomsky Hierarchy of Grammar (contd)
• Type 0 grammars are not typically used in natural language processing (NLP) due to
their computational complexity and lack of constraints.
Example : The cat chases the mouse.
"Chases mouse the cat." or even "The the mouse cat chases."
Thus rule must be followed to construct grammatically correct sentences in Chomsky Type-
1 Context Sensitive Grammar.
To describe the grammar associated with this example , we have a set of production rules.
These rules explain how sentences in the language are constructed from words and phrases.
Chomsky Hierarchy of Grammar (contd)
Some more Rules as illustration – English Grammar (for other languages ... xyz ….)
Pronoun Antecedent Agreement:
Rule: Pronouns must agree in number and gender with their antecedents.
Every student must bring his or her pencil.
Use of articles :
Rule: The definite article 'the' is used before a noun that is specific or known to the
listener, while 'a' or 'an' is used for non-specific nouns in the singular form.
She wants an apple from the basket.
• These rules illustrate how the context surrounding words or phrases can dictate the
appropriate grammatical forms to use, which is a hallmark of context-sensitive (Type-
1) grammars.
• Starting from a string in question β, there are finitely many ways in which rules can be
applied backward to it.
Chomsky Hierarchy of Grammar (contd)
Type 2 - Context-free Grammar
Chomsky Type-2 Grammar, also known as context-free grammar (CFG), is a formal
grammar in which every production rule is of the form α → β where α is a single non-
terminal symbol, and β is a string of terminals and/or non-terminals (β can be empty). The
productions need NOT follow condition that len(α ) <= len(β) instead
- Every string has an equal number of 'α's and 'β's, but in any order which yields a context-
free grammar. ab → ba
aabb → bbaa
CB → DB
- Further, it follows a hierarchical structure i.e it consists a set of production rules that can
be applied recursively and can generate a tree structure.
The hierarchical structure refers to the way sentences can be broken down into smaller parts,
and those parts can be broken down further, following the CFG rules. This leads to the
creation of a parse tree, which visually represents the breakdown of a sentence into its
grammatical parts.
In a parse tree for a context-free grammar:
The root node is typically the start symbol (often S for sentence).
The leaf nodes are terminal symbols, which correspond to the words of the sentence.
The interior nodes are non-terminal symbols, representing the syntactic categories (like noun
phrases, verb phrases, etc.).
Chomsky Hierarchy of Grammar (contd)
For the sentence "The cat chases the mouse.", we define a context-free rule as follows:
S→NPsingular VPsingular
S
NPsingular →Det N singular
/ \
VPsingular → Vsingular NP NP VP
/ \ / \
1. Start with the Sentence (S): Det N V NP
The initial rule identifies the sentence structure: | | | / \
S→NP VP The cat chases Det N
| |
2. Expand the Noun Phrase (NP) for the Subject: The mouse
Here, we expand the noun phrase to include a
determiner (Det) and a singular noun (N_singular):
NP→Det Nsingular
"The cat": NP→[The][cat]
The tree shows the hierarchical structure of the sentence. The sentence is divided into a noun
phrase and a verb phrase. The noun phrase NP consists of a determiner Det ("The") and a noun N
("cat"), which together refer to the subject of the sentence. The verb phrase VP consists of a verb
V ("chases") and a noun phrase NP, which is the object of the sentence. This object NP is again
made up of a determiner "The" and a noun "mouse".
Chomsky Hierarchy of Grammar (contd)
Type 3 - Regular Grammar
• Chomsky's Type-3 Grammar, also known as Regular Grammar, is the simplest type of
grammar in the Chomsky hierarchy.
• The production rules in a Type-3 grammar are restricted to a single non-terminal on the
left side and on the right side either a single terminal or a terminal followed by a non-
terminal.
α → β or α → β Y He talks, She runs
where α, Y ∈ N (Non terminal) Quickly", "Happily
and β ∈ T (Terminal) Unhappy , Happiness
• The Type-3 Grammar are suitable for describing the simplest syntactic structures that
involve direct adjacency and do not require nesting or recursion.
• It does not allow hierarchical structure or much nesting or recursion, unlike context-free
grammars.
• For each regular grammar 𝒢, it is possible to construct an algorithm (a FSA) that reads a
string from left to right, and then outputs ‘yes’ if the string belongs to L( 𝒢), and ‘no’
otherwise.
Conclude Chomsky Grammar
• A simple sentences built up in a hierarchical fashionS
from smaller parts to the complete sentence. ├── NP
│ ├── Det
│ │ └── The
• This hierarchical structure is critical for
│ ├── Adj
understanding the syntactic function of each word │ │ └── quick
and phrase within a sentence. │ ├── Adj
• It shall allow the analysis and generation of
│ │ └── brown
syntactically correct sentences in natural language │ └── N
processing. │ └── fox
└── VP
├── V
│ └── jumps
├── P
│ └── over
└── NP
├── Det
│ └── the
├── Adj
3. Where are Natural Language located ? │ └── lazy
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367686/ └── N
└── dog
Semantic
Analysis
Syntax
Lexical Analysis
Analysis
Code
Generation
Code
Optimisation
Concept of Parsing
Parsing in NLP
• Parsing in basic terms can be described as breaking down the sentence into its constituent
words in order to find out the grammatical type of each word or alternatively to decompose
an input into more easily processed components.
• Every natural language consist of its own grammar rules according to which the sentences are
formed. Parsing is used to find out the sequence of rules applied for sentence generation in
that particular language.
• The basic connection between a sentence and the grammar is derived from the parse tree.
Natural Language processing provides us with two basic parsing techniques viz, Top-Down
and Bottom-Up. Their name describes the direction in which parsing process advances.
Top-Down parsing
• The process involves predicting the structure of a sentence from the start symbol of the
grammar down to the terminals, which correspond to the words in the sentence.
• The start symbol S represents the most general concept, typically a sentence in natural
language grammars.
• The algorithm starts from the tops of the tree i.e S, by looking on the grammar rules with S
on left hand side so that all the possible trees are generated.
Top-Down parsing
• The algorithm proceeds by substituting the start symbol with one of its possible
expansions (productions). This prediction is guided by the grammar rules, which define
how symbols can be replaced or expanded.
• The process is recursive; for each non-terminal symbol encountered, the parser selects a
production rule to expand it further, moving towards the terminal symbols.
• This expansion continues until the parser reaches the terminal symbols, which are the
actual words or tokens of the input sentence.
• If the parser selects a production that doesn't lead to a successful match with the input
sentence, it may need to backtrack. Backtracking involves going back up the parse tree to
a previous decision point and trying a different production rule.
• This can be computationally expensive in cases where many backtracks are necessary.
• The goal of top-down parsing is to construct a parse tree that represents the syntactic
structure of the input sentence according to the grammar. If the entire input sentence is
successfully matched against the productions of the grammar, the sentence is considered
syntactically valid.
• Top-down parsing can be implemented in various forms,
o The simplest being a Recursive Descent Parser
o Predictive Parser
Top-Down parsing (contd)
Recursive Descent Parser
• Recursive descent parsing is one of the most straightforward forms of parsing.
• This parser checks the syntax of the input stream of text by reading it from left to right
(hence, it is also known as the Left-Right Parser).
• The parser first reads a character from the input stream and then verifies it or matches it
with the grammar's terminals. If the character is verified, it is accepted else it gets rejected.
• Recursive descent parsers are straightforward to implement and can handle a wide range
of grammars, including those that are not context-free.
• Since the grammar in parser is manually coded, it can include sophisticated error reporting
and recovery mechanisms. Consider
expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/') factor)*
<h1> , <b>,
factor ::= NUMBER | '(' expression ')‘
<head> , <html> ,
<img>
Predictive Parser
• The Predictive Parser is a type of top-down parser that is specifically designed to work
with a class of grammars known as LL grammars, where the first "L" stands for scanning
the input from left to right, and the second "L" for producing a leftmost derivation.
Top-Down parsing (contd)
Grammar Rule :
The basic sentence is understood in terms of noun phrase NP and verb phrase VP.
Other rules lets say are stated as below :
• Initially, the parser shifts each word of the sentence onto a stack, one word at a time,
starting from "Obama".
• When the items on the stack match the right side of a grammar rule, the parser reduces
those items into a single item based on the rule. For example, after shifting "Obama", it
matches the rule N -> 'Obama', so "Obama" is reduced to N.
Bottom-Up parsing (contd)
Shift "Obama" onto the stack. (Stack: [Obama])
Reduce "Obama" to N using the rule N -> 'Obama'. (Stack: [N])
Reduce N to NP using the rule NP -> N. (Stack: [NP])
Shift "eats" onto the stack. (Stack: [V, eats])
Reduce "eats" to VP using the rule VP -> V -> 'eats'.
Shift "an" onto the stack.
Reduce "an" to Det and further using the rule Det -> NP ->VP
Reduce "apple" to N using the rule N -> NP ->VP
Reduce N to NP using the rule NP -> Det N.
Reduce NP VP to S using the rule S -> VP NP and NP -> Det N
Bottom-Up parsing (contd)
• This process continues with shifting and reducing according to the rules defined in the
grammar until the entire sentence is reduced to the start symbol (S), indicating
successful parsing.
• The ShiftReduceParser might not always find a parse for a sentence, especially if the
grammar is ambiguous or doesn't cover the sentence structure. In such cases, we need to
adjust the grammar.
When it comes to a syntactic parser, there is a chance that a syntactically formed sentence could
be meaningless. To get to the semantics, we need a deeper
understanding of semantics structure of the sentence.
Thanks