NLP Self Notes

Application of
NLP
Auto-Correct • Information retrieval & web
search • Grammar correction & Question
answering •Sentiment Analysis. •Text
Classification. •Chatbots & Virtual Assistants.
•Text Extraction. •Machine Translation.
•Text Summarization. •Market Intelligence.
Challenges in
NLP
1. Ambiguity: Words can have different meanings. Resolving the correct meaning is hard.
2. Language Variations: People use language differently. Understanding different styles and regions is
tough.
3. Context Understanding: Knowing the full meaning of a sentence by considering the surrounding
text is difficult.
4. Lack of Data: NLP needs lots of data to learn, but good data is not always available.
5. Common Sense Reasoning: Teaching computers to understand the world like humans is tricky.
6. Sentiment and Emotion Analysis: Figuring out emotions and feelings in text is hard, especially
sarcasm and irony.
7. Ethical and Bias Concerns: NLP models can be biased and unfair. It's important to address and fix
these issues.
Researchers and engineers are working on solving these challenges to make NLP better, fairer, and
more reliable.
Challenges in Morphological Models

1. Complexity: Some languages have complex word structures with many forms. Models struggle to
handle all the variations.
2. Data Sparsity: Models need lots of data to learn patterns, but sometimes there's not enough data,
making it hard to handle rare word forms.
3. Out-of-Vocabulary Words: Models struggle with words they haven't seen before, as they don't
know how to handle them.
4. Ambiguity and Homonymy: Words can have multiple meanings or sound the same but mean
different things. Models have trouble picking the right meaning or analysis.
5. Rare Languages: Building models for languages with few resources is challenging. Limited data and
tools make it hard to create accurate models.
6. Non-concatenative Morphology: Some languages have unique ways of forming words. Models
struggle with these non-standard patterns.
7. Morphological Change: Languages change over time, and models need to handle historical
variations in morphology accurately.
Addressing these challenges requires developing sophisticated morphological models that can
handle the complexities of different languages, manage data sparsity, deal with ambiguity, and adapt
to evolving linguistic structures. Researchers work on improving these models to enhance their
accuracy and applicability in various language processing tasks.
Importance and Goals of NLP

Importance of NLP:
1. Communication: NLP enables computers to understand and generate human language, improving
communication between humans and machines. This is crucial for developing voice assistants,
chatbots, and other applications that interact with users through natural language.
2. Information Extraction: NLP helps extract relevant information from large volumes of text. It can
summarize articles, extract key facts, and analyze sentiments, allowing users to quickly access and
make sense of vast amounts of information.
3. Language Translation: NLP plays a vital role in language translation by automatically translating
text from one language to another. This helps bridge language barriers, facilitate international
communication, and promote cross-cultural understanding.
4. Sentiment Analysis: NLP can analyze sentiments and emotions expressed in text, such as social
media posts, customer reviews, and news articles. This enables businesses to understand public
opinion, assess customer satisfaction, and make data-driven decisions.
5. Text Classification: NLP allows computers to classify text into different categories or topics. This is
useful in tasks such as spam detection, news categorization, sentiment classification, and content
filtering.
6. Information Retrieval: NLP helps in building search engines that can understand the user's query
and retrieve relevant information from a vast collection of documents. It improves search accuracy
and enables users to find what they're looking for quickly.
7. Question Answering: NLP systems can comprehend questions and provide relevant answers. This is
useful in applications like virtual assistants, customer support chatbots, and educational platforms.
The primary goals of NLP are:
1. Understanding Human Language: The main goal is to enable computers to understand human
language, including its grammar, semantics, and context. This involves tasks like parsing sentences,
extracting meaning, and interpreting nuances.
2. Natural Language Generation: NLP aims to enable computers to generate human-like language.
This includes tasks like text generation, summarization, and language translation.
3. Language Processing Efficiency: NLP strives to develop algorithms and models that process
language efficiently, allowing for fast and accurate analysis of text data.
4. Real-World Applications: NLP aims to create practical applications that can assist users in various
domains, such as healthcare, education, finance, customer service, and more.
Overall, NLP enhances our ability to interact with technology, enables us to process and understand
vast amounts of textual data, and opens up new opportunities for automating language-related tasks
in a wide range of industries and applications.
Explain any one Morphological model

One popular morphological model in Natural Language Processing (NLP) is the Finite-State
Transducer (FST) model. The FST model is a computational framework that represents the
morphology of a language using finite-state machines. It consists of two main components: a finite-
state automaton for the input (called the lexical transducer) and a finite-state automaton for the
output (called the morphological transducer).
The lexical transducer has all the possible word forms in a language, like the different versions of a
word with different endings. It also has information about how the words are made, like if there are
any prefixes or suffixes added to them. It uses these rules to show how words can change
The lexical transducer represents the set of possible word forms in a language, along with their
corresponding morphological properties. It encodes the rules and patterns of word formation,
including inflections, prefixes, suffixes, and other linguistic variations. Each transition in the lexical
transducer corresponds to a morphological transformation.
The morphological transducer helps analyze or create new words. It takes a word as input and
follows the rules in the lexical transducer to give the right output. It goes step by step, changing the
word according to the rules until it gets the final form
The morphological transducer, on the other hand, represents the analysis or generation of words. It
takes an input word and applies the rules encoded in the lexical transducer to produce the
corresponding output word form. This process involves traversing the states and transitions in the
transducer, applying the appropriate morphological transformations at each step.
The FST model is great because it's fast and can handle complicated word structures. It can also
understand things like the type of word (noun, verb, etc.) and other language rules.
It can handle complex morphological systems and support various linguistic features, such as part-of-
speech tags, grammatical categories, and morphophonological rules
Applications of the FST model include part-of-speech tagging, morphological analysis, lemmatization,
spelling correction, and language generation. It has been successfully applied to several languages
with rich morphological structures, such as Finnish, Turkish, and Arabic.
Overall, the FST model provides a powerful framework for modeling and processing the morphology
of languages, enabling accurate analysis and generation of word forms based on their morphological
properties.
Differentiate between surface and deep structure
In the field of Natural Language Processing (NLP), surface structure and deep structure are two
concepts that relate to the analysis and representation of language.
1. Surface Structure:
Surface structure refers to the literal or syntactic representation of a sentence. It includes the
arrangement of words, grammar, and punctuation. Surface structure focuses on the surface-level
characteristics of a sentence without considering its underlying meaning. It is the structure that we
perceive when we read or hear a sentence.
Example:
Consider the sentence: "The cat chased the mouse."
In terms of surface structure, we can observe the word order, articles, and verb tenses. The surface
structure of this sentence represents the syntactic arrangement of the words, which helps us
understand the basic grammatical structure without considering the semantics or deeper meaning.
2. Deep Structure:
Deep structure refers to the underlying meaning or semantics of a sentence. It captures the intended
interpretation and represents the relationships between words and phrases in a more abstract and
meaningful way. Deep structure is concerned with the understanding and interpretation of the
sentence beyond its surface-level representation.
Example:
Let's take the same sentence as before: "The cat was being chased by the dog."
In terms of deep structure, we understand that the cat is performing the action of chasing the
mouse. Deep structure analyzes the relationships between entities (cat and mouse) and their actions
(chased) to capture the meaning or intent behind the sentence. It goes beyond the surface-level
syntax to extract the underlying semantics.
In NLP, deep structure analysis is essential for tasks such as sentiment analysis, semantic parsing, and
machine translation, where understanding the meaning and context of a sentence is crucial. Surface
structure, on the other hand, is often used for tasks like part-of-speech tagging, parsing, and
syntactic analysis.
or
The surface structure is like the way the sentence looks or sounds on the outside. It's the actual
words and the order in which they appear. For example, let's take the sentence "The dog chased the
cat." This is the surface structure of the sentence because it shows us the words and their
arrangement.
On the other hand, the deep structure is the underlying meaning or idea that the sentence is trying
to convey. It's like the hidden message behind the words. Using the same example, the deep
structure of the sentence "The dog chased the cat" would be something like "The cat was being
chased by the dog."
So, to summarize, the surface structure is the way the sentence is written or spoken, and the deep
structure is the hidden meaning behind the sentence.
In NLP, we often try to understand the deep structure of sentences because it helps us extract
meaning and answer questions. Sometimes, the surface structure can be different, but the deep
structure remains the same. For example, let's take the sentence "The cat was chased by the dog."
Even though the words and order are different in this sentence, the deep structure is still the same as
the previous example.
Understanding the deep structure is important because it allows us to comprehend the true meaning
of a sentence, even if the words or arrangement are slightly different. It helps us teach our robot to
understand language more like humans do.
Examples for early NLP systems

Here’s a list of the top 10 natural language processing examples:
1. Language Translation
2. Search Engine Results
3. Smart Assistants
4. Customer Service Automation
5. Email Filters
6. Survey Analytics
7. Chatbots
8. Social Media Monitoring
9. Text Analytics
10. Predictive Text
Certainly! Here are a few examples of early Natural Language Processing (NLP) systems, explained in simple
terms:
1. ELIZA:
ELIZA was one of the earliest NLP systems developed in the 1960s. It was designed to simulate a
conversation with a psychotherapist. ELIZA used simple pattern matching techniques to respond to user
inputs. It would identify keywords in a sentence and generate pre-programmed responses based on those
keywords. Although ELIZA couldn't truly understand or have a meaningful conversation, it created an
illusion of understanding by repeating and rephrasing what the user said.
2. Chatbots:
Chatbots have been around for quite some time. In their early forms, chatbots were rule-based systems
that followed predefined rules and patterns to interact with users. They would analyze user inputs, match
them with predetermined responses, and provide answers accordingly. These early chatbots were limited
in their capabilities and could only handle specific questions or tasks for which they were programmed.
3. Spell Checkers:
Spell checkers are NLP systems that help us identify and correct spelling mistakes in our writing. Early spell
checkers used a dictionary of words to compare and match the words we typed. If a word didn't match any
word in the dictionary, the system would suggest alternative words that were similar in spelling. This helped
us catch and fix spelling errors, making our writing more accurate.
4. Machine Translation:
Early machine translation systems aimed to translate text from one language to another. They used simple
word-by-word substitution methods. For example, if you wanted to translate an English sentence into
French, the system would look up each English word in its dictionary and replace it with its corresponding
French word. However, this approach often led to inaccurate translations because it didn't consider the
context or grammar of the sentence.
These early NLP systems paved the way for advancements in the field, and over time, more sophisticated
techniques and algorithms were developed to enhance the accuracy and understanding of natural
language.
Finding the Structure of Documents:

Complexity of the Approaches
Performances of the Approache
Finding the structure of documents is a challenging task in natural language processing (NLP) that
involves understanding the organization, hierarchy, and relationships of various elements within a
document. The complexity and performance of approaches for document structure analysis can vary
based on different factors. Here are some key points to consider:
1. Complexity of Approaches:
- Rule-based Approaches: Some approaches rely on predefined rules and patterns to identify
structural elements like headings, paragraphs, lists, or tables. These rule-based approaches tend to
be relatively simpler but may struggle with handling variations and complex document structures.
- Machine Learning Approaches: Other approaches employ machine learning techniques, such as
supervised or unsupervised learning, to automatically learn patterns and features from annotated or
unannotated document data. These approaches can handle more complex structures but require
training data and may be computationally intensive.
- Hybrid Approaches: Hybrid approaches combine rule-based and machine learning techniques to
leverage the strengths of both. They can offer a good balance between complexity and performance.
2. Performance of Approaches:
- Accuracy: The accuracy of document structure analysis approaches is an important performance

measure. Higher accuracy means better identification and understanding of the document structure.
- Scalability: The ability of an approach to handle large documents or large collections of

documents efficiently is crucial. Scalable approaches can process documents quickly and can be
applied to diverse data sets.
- Generalizability: How well an approach performs on different types of documents and domains is
another performance aspect. Robust approaches are capable of handling various document formats
and structures.
- Error Handling: Document structure analysis approaches may encounter errors or ambiguities,
especially in documents with complex layouts or unstructured content. Robust approaches should
handle such cases gracefully and provide meaningful results.
It's worth noting that the performance of document structure analysis approaches can vary
depending on factors like the quality and diversity of training data, the complexity of document
structures, and the specific requirements of the task or application.
Researchers and practitioners continually work on improving the performance of document structure
analysis approaches through advancements in machine learning algorithms, data annotation, and
fine-tuning of models. Evaluating the performance of these approaches often involves using
benchmark datasets and metrics specific to the document structure analysis task.
Important things to consider:
1. Document Structure Analysis
2. Text Segmentation
3. Entity Recognition
4. Topic Extraction
5. Sentiment Analysis
6. Information Retrieval
Overall, document structure analysis is an important task in NLP as it provides a foundation for tasks
such as information extraction, summarization, and document understanding.
Improving the performance of approaches in analyzing document structure helps in extracting

meaningful information and gaining insights from textual data.
Representation of syntactic structure,

with the help of a neat diagram
The representation of syntactic structure in Natural Language Processing (NLP) often involves the use of a tree-like structure
called a parse tree or syntax tree. This tree represents the hierarchical relationship between words in a sentence and their
syntactic roles.
Here's a step-by-step explanation of how a syntax tree is constructed:
1. Start with a sentence: Let's take the sentence "The cat chased the mouse."
2. Identify the parts of speech: Determine the grammatical roles of each word. In this case, we have "The" (article), "cat"
(noun), "chased" (verb), "the" (article), and "mouse" (noun).
3. Determine the sentence structure: Identify the relationships between words. In this sentence, "cat" is the subject of the
verb "chased," and "mouse" is the object of the verb.
4. Create the syntax tree: Begin with a root node representing the whole sentence. Attach child nodes to represent each
word and their relationships. For example, the root node would have two child nodes, one for "cat" and one for "chased."
The "chased" node would have two child nodes, one for "The" and one for "mouse."
The resulting syntax tree would look like this:
Sentence .
/ \ .
cat chased .
/ \ .
The mouse
In this tree, the nodes represent words, and the edges represent the syntactic relationships between them. The tree
structure shows how the words are connected and organized within the sentence.
This representation of syntactic structure is useful in various NLP tasks, such as parsing, part-of-speech tagging, and
understanding the grammatical structure of sentences. It helps in analyzing and interpreting the syntactic relationships
between words, enabling computers to understand and process natural language more effectively.
Models for ambiguity resolution in Parsing

1. Probabilistic Context-Free Grammars (PCFGs):
PCFGs use a tree-like structure called a parse tree to represent the possible ways a sentence can be
parsed. Each node in the tree represents a phrase or a word, and the edges represent the
relationships between them. Additionally, PCFGs assign probabilities to the rules that govern the
generation of the parse tree.
Structure:
The PCFG structure consists of:

- Non-terminal symbols: Represent phrases or parts of speech (e.g., noun phrase, verb phrase).
- Terminal symbols: Represent words or tokens.
- Production rules: Define how non-terminal symbols can be expanded into other symbols. Each rule
has a probability associated with it.
For example, consider the sentence "The cat chased the mouse." In a PCFG, the parse tree for this
sentence could have a rule like "S -> NP VP," where S represents a sentence, NP represents a noun
phrase, and VP represents a verb phrase. The probabilities associated with these rules determine the
likelihood of each possible parse tree.
2. Generative Models for Parsing:
Generative models aim to generate a sentence by understanding its grammatical structure. One
popular generative model for parsing is the Hidden Markov Model (HMM).
Structure:
In the case of parsing, the HMM structure includes:
- States: Represent different syntactic categories or parts of speech (e.g., noun, verb).
- Initial state distribution: Represents the probability of starting in each state.
- Transition probabilities: Represent the likelihood of transitioning from one state to another.
- Emission probabilities: Represent the likelihood of emitting a word given a state.
Generative models like HMMs analyze the structure of a sentence to generate possible parse trees
based on probabilities. By comparing these parse trees, the model can choose the most likely one.
3. Discriminative Models for Parsing:
Discriminative models focus on identifying the correct parse tree directly, rather than generating
sentences. One popular discriminative model for parsing is the Maximum Entropy Markov Model
(MEMM).
Structure:
The structure of a MEMM includes:
- Features: Represent linguistic properties or patterns of the sentence.
- Weights: Assign importance to each feature based on how informative it is for parsing.
MEMMs consider multiple features of a sentence, such as the words, part-of-speech tags, and the
structure of neighboring words, to predict the correct parse tree.
In summary, probabilistic context-free grammars (PCFGs) assign probabilities to parse trees,

generative models (like HMMs) focus on generating sentences and assigning probabilities to different
structures, and discriminative models (like MEMMs) directly predict the correct parse tree based on
features and weights. These models help in understanding the structure and meaning of sentences in
natural language.
or
1. Probabilistic Context-Free Grammars (PCFGs):
Imagine a "grammar" as a set of rules that define how words combine to form sentences. PCFGs
extend this concept by assigning probabilities to each rule. These probabilities represent the
likelihood of a particular rule being used.
For example, consider the sentence "The cat chased the mouse." In a PCFG, there would be rules like
"S -> NP VP" (a sentence consists of a noun phrase followed by a verb phrase) and "NP -> Det N" (a
noun phrase consists of a determiner followed by a noun). Each rule would have a probability
associated with it.
PCFGs help resolve ambiguity by considering the probabilities of different rules when parsing a
sentence. The parser analyzes the sentence and constructs parse trees, evaluating the probabilities
of different tree structures. The most probable parse tree, based on the probabilities assigned by the
PCFG, is considered the correct interpretation of the sentence.
2. Generative Models for Parsing:
Generative models take a different approach. They aim to model how sentences are generated from
underlying syntactic structures. One popular generative model is the Hidden Markov Model (HMM).
Think of an HMM as a system with hidden states and observed outputs. In parsing, the hidden states
represent the underlying syntactic structures, and the observed outputs are the words in the
sentence. The model estimates the probabilities of transitioning between different states and
generating the observed outputs.
Generative models help resolve ambiguity by considering all possible parse trees and assigning
probabilities based on how likely they are to generate the observed sentence. The model searches
for the most probable parse tree that generates the given sentence, considering both the transitions
between syntactic structures and the generation of words.
3. Discriminative Models for Parsing:
Discriminative models take a different approach than generative models. Instead of modeling the
generation of sentences, they directly model the relationship between the input sentence and the
correct parse tree. One popular discriminative model is the Maximum Entropy Markov Model
(MEMM).
MEMMs use features extracted from the input sentence to predict the most likely parse tree. These
features can include part-of-speech tags, word dependencies, or other linguistic characteristics. The
model learns from training data where the correct parse trees are known, and it estimates the
probabilities of different parse trees given the input features.
Discriminative models help resolve ambiguity by learning patterns from the training data and using
them to predict the correct parse tree for a given sentence. By considering the specific features and
their relationships to parse trees, these models can make informed decisions on how to resolve
ambiguity.
In summary, probabilistic context-free grammars assign probabilities to grammar rules, generative
models analyze how sentences are generated, and discriminative models directly model the
relationship between the input sentence and the correct parse tree.
Each approach helps resolve ambiguity in parsing by considering probabilities, generative processes,
or discriminative patterns.
Multilingual Issues:
1. Different languages, different rules: Just like people speak different languages, each language has
its own set of rules for grammar and sentence structure.
2. Word order: In some languages, the order of words in a sentence is different from what we are
used to. For example, in English, we say "I like pizza," but in Spanish, it is "Me gusta la pizza" (which
translates to "To me, pleases the pizza").
3. Word meanings: Words may have different meanings in different languages. For example, the
word "table" in English refers to a piece of furniture, but in French, "table" means "board."
4. Ambiguity: Languages can have ambiguous words or phrases that can be understood in different
ways. Translating these ambiguities accurately can be challenging.
5. Cultural context: Different languages and cultures may express ideas and concepts differently.
Translating these cultural nuances accurately is important to convey the intended meaning.
In simple terms, multilingual issues arise because different languages have different rules and ways
of expressing ideas.
Treebanks and Their Role in Parsing:

1. What are Treebanks? A treebank is like a collection of sentences from a language that are
annotated with their syntactic structure using a parse tree.
2. What is a parse tree? A parse tree is a way to show how words in a sentence are related to each
other. It looks like a tree with branches connecting words and phrases.
3. How are Treebanks created? Linguists and language experts carefully analyze sentences and create
parse trees for each word. These parse trees are then used to build the treebank.
4. Role in parsing: Treebanks are essential for training and evaluating parsing algorithms. They
provide a reference for computers to understand the correct syntactic structure of sentences in a
specific language.
5. Improving parsing accuracy: By using large and diverse treebanks, researchers can improve the
accuracy of parsing algorithms. These treebanks help computers learn the patterns and rules of a
language's syntax.
6. Cross-lingual analysis: Treebanks are also useful for comparing different languages. By studying the
similarities and differences in parse trees across languages, researchers can gain insights into
language universals and linguistic diversity.
Treebanks, on the other hand, are like collections of sentences with annotated structures that help
computers understand the grammar of a language and improve their ability to analyze and
understand sentences.

NLP Self Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Self Notes

Uploaded by

Copyright:

Available Formats

Application of

Challenges in Morphological Models

Importance and Goals of NLP

The primary goals of NLP are:

Explain any one Morphological model

Consider the sentence: "The cat chased the mouse."

Examples for early NLP systems

Finding the Structure of Documents:

- Accuracy: The accuracy of document structure analysis approaches is an important performance

- Scalability: The ability of an approach to handle large documents or large collections of

Important things to consider:

1. Document Structure Analysis

Improving the performance of approaches in analyzing document structure helps in extracting

Representation of syntactic structure,

Here's a step-by-step explanation of how a syntax tree is constructed:

The resulting syntax tree would look like this:

Models for ambiguity resolution in Parsing

The PCFG structure consists of:

- Terminal symbols: Represent words or tokens.

2. Generative Models for Parsing:

In the case of parsing, the HMM structure includes:

- Initial state distribution: Represents the probability of starting in each state.

- Emission probabilities: Represent the likelihood of emitting a word given a state.

3. Discriminative Models for Parsing:

The structure of a MEMM includes:

- Features: Represent linguistic properties or patterns of the sentence.

In summary, probabilistic context-free grammars (PCFGs) assign probabilities to parse trees,

2. Generative Models for Parsing:

3. Discriminative Models for Parsing:

Treebanks and Their Role in Parsing:

You might also like