You are on page 1of 4

WID3002: NATURAL LANGUAGE PROCESSING

PART C: ONLINE OPEN BOOK TEST

Question 1:

a)

For Social Media Text:

Data from Social Media can come in many type.If we do scrap from the social media, there
can be cases where there's HTML Tag and other formatting styles that need to be cleared
.For Twitter cases as an example ..Some malaysian tweets can contain mixed of languages
which require higher level of normalization.In some situation there can be new words that
people use that does not yet exist in corpus which could make the process harder.There also
cases where there's new analogy that poses semantical ambiguity.However theres is some
basic preprocessing technique that is adequate to address most of the data.

Preprocessing:
Remove any special characters, punctuation, or symbols that are not relevant to the text
analysis. This step ensures that only the meaningful words remain.

Tokenization:
Split the text into individual words, also known as tokens. This step helps in identifying the
boundaries between words.

Lowercasing:
Convert all words to lowercase. This step ensures that the model treats "cat" and "Cat" as
the same word, reducing the complexity.

Stop Word Removal:


Remove common and insignificant words like "and," "the," "is," etc. These words, called stop
words, don't contribute much to the overall meaning.

Stemming/Lemmatization:
Reduce words to their base or root form to capture the essence. For example, "running" and
"ran" can be transformed to "run."

Building a Vocabulary:
Create a list of unique words from the processed text. Each word in the vocabulary will
become a feature in the feature vector.

Feature Extraction:
Count the occurrences of each word in the vocabulary within the text. This count becomes
the value for each feature in the feature vector.

For Newspaper Text:

Data from news is mostly simpler to process because of the nature of the news writing.The
news is mostly written in a single language which remove the requirements to check and
translate the words.However,the news from online article might still need to be clean from
the tags and styling.The use of emojis and slang is almost none in news making the process
easier.In news the basic preprocessing technique is adequate process the data.

Preprocessing:
Similar to social media text, remove special characters, punctuation, and symbols that are
not relevant to the text analysis.

Sentence Segmentation:
Split the text into individual sentences. This step helps in analyzing the structure and context
of each sentence separately.

Tokenization:
Split each sentence into individual words or tokens.

Part-of-Speech Tagging:
Assign a part of speech (noun, verb, adjective, etc.) to each word. This information provides
context and helps in understanding the sentence structure.

Named Entity Recognition:


Identify named entities like people, organizations, locations, etc., in the text. This step
provides additional information about the content.

Lemmatization:
Reduce words to their base or root form, similar to the social media text.

Building a Vocabulary:
Create a list of unique words from the processed text. Each word in the vocabulary will
become a feature in the feature vector.

Feature Extraction:
Count the occurrences of each word in the vocabulary within the text. This count becomes
the value for each feature in the feature vector.
b)

i)Rewrite S1 and S2 after the text normalization process

S1: Natural language process become important since soon begin talk computer.
S2: If computer understand natural language become much simpler use.

ii)What is the vocabulary, 𝑉?

V = {natural, language, processing, become, important, since, soon, begin, talk, computer,
understand, much, simpler, use}

iii)What are the number of bigrams and trigrams in S2? Show the bigrams and
trigrams of the sentence.

Bigrams in S2: (8 bigram)

1. (If, computer)
2. (computer, understand)
3. (understand, natural)
4. (natural, language)
5. (language, become)
6. (become, much)
7. (much, simpler)
8. (simpler, use)

Trigrams in S2: (7 bigram)

1. (If, computer, understand)


2. (computer, understand, natural)
3. (understand, natural, language)
4. (natural, language, become)
5. (language, become, much)
6. (become, much, simpler)
7. (much, simpler, use)
Question2

Discuss the purpose of the language model for the Statistical Machine Translation
(SMT). Analyse some challenges of SMT and suggest solutions to address the
Challenges.

The language model in Statistical Machine Translation (SMT) is like a helpful tool that tries to
understand different languages and make translations. Its purpose is to make sure the
translations it generates are accurate and sound natural. It does this by looking at patterns
and context in the languages.

For example, imagine you have a sentence in one language, and you want to translate it into
another language. The language model helps by analysing the words in the sentence and
figuring out the best way to translate them. It considers things like word meanings, grammar
rules, and how words fit together in a sentence.

The goal is to create translations that not only make sense but also sound like something a
native speaker would say. So, the language model is there to make the translations as good
as possible. It's like having a language expert helping you to understand and express things
correctly in different languages.

Challenges of SMT Solutions

Word Ambiguity: Words can have multiple


Use context to choose the right meaning.
meanings.

Structural Differences: Different languages have Rearrange words or phrases to match the target
different word orders or sentence structures. language structure.

Rare or Out-of-Vocabulary (OOV) Words: Words


Break words into smaller parts or find similar
that are not common or encountered during
known words.
training.

Long-Range Dependencies: The meaning of a


Capture relationships between distant words to
word may depend on other words far away in the
ensure accurate translations.
sentence.

You might also like