You are on page 1of 25

DISCOURSE BASED OPINION MINING ON

ROMAN URDU DATASETS


Top 10 News Stories That Broke on Social
Media
Osama Bin Laden's Death

Sohaib Athbar, an IT consultant living near


Abbottabad, Pakistan, unwittingly reported the raid
on Osama bin Laden's compound, which ended in bin
Laden's a death.
President Obama announced the events in a news
conference the next day.
Whitney Houston's Death
■ On Feb. 11, 2012, news broke that
Whitney Houston had passed away
unexpectedly at the Beverly Hilton Hotel.
■ Many of the superstar’s fans saw the
news on Twitter, and tweeted to express
their grief and share gratitude for her
music.
■ Twitter said her passing generated more
than 10 million tweets or 73,662 tweets
per minute.
Michael Jackson’s Death
The Royal Wedding Announcement
Despite its many ancient traditions, the
British monarchy has not shied away from
sharing important news on social media.
Clarence House's official Twitter account
announced the royal wedding.
The US Presidential Elections 2016

Analysis of social media did a better job at predicting Trump’s win than the polls.
Benefits of the Social Media
Roman Urdu Analysis

■ Roman Urdu is a term used for the


Urdu language written in Roman
script.
■ There are almost 30 million
internet users in Pakistan and
Twitter holds a position among the
top 10 websites with respect to
popularity
■ Roman Urdu is a just a symbolic expression of words in Urdu language
written using English character set
■ Commonly used to overcome the shortcoming of not knowing English
language well enough to communicate ones thoughts through it
■ While still being able to use interfaces that are in English.
■ For example a common word like “popular” was found to have the
following representations in the datasets used for analysis “mashhoor”,
“mashoor”, “mashor”, “mashour”, “mashhur”, “mashhor”.
■ Similarly the word “beautiful” in English was found to have four forms in
Roman Urdu “khobsorat”, “khoobsurat”, “khobsurat”, “khubsorat”.
Lexical Normalization of Roman Urdu
Phonetic algorithms work very effectively for dealing with issues arising from words having similar
sounds but spelled differently.
Substring Replaced by Substring Replaced by
"ain" (at the end) "ein“ "ar" (except at the start) "r“
"ai" "ae“ "iy" (with multiple y's) "i“
"ay" (at the end) "e“ "ih" (with multiple h's) "eh“
"ey" (at the end) "e“ (multiple "s") "s“
"ie" (at the end) "y“ "ry" (except at the end) "ri“
"es" (at the start) "is“ "sy" (except at the end) "si“

(multiple "a") "a“ "ty" (except at the end) "ti“


(multiple "j") "j“ changing "i" in the end with "y" when it is preceded by
(bcdefghijklmnopqrtuvwxyz)
(multiple "d") "d“ removing 'h' if h is preceded by (acefghijlmnoqrstuvwxyz)
"u" "o“ 'k' 'q'
(multiple "o") "o“ (multiple "ee") "i“
Discourse Parser
■ An essential phenomenon in natural language processing is the use of discourse relations to establish
a coherent relation, linking phrases and clauses in a text.
■ The presence of linguistic constructs like connectives, modals, conditionals and negation can alter
sentiment at the sentence level as well as the clausal or phrasal level.
■ Consider the example, “@user share 'em! I’m quite excited about Tintin, despite not really liking
original comics. Probably because Joe Cornish had a hand in...”
■ The overall sentiment of this example is positive, although there is equal number of positive and
negative words.
■ This is due to the connective despite which gives more weight to the previous discourse segment.
■ Any bag-of-words model would be unable to classify this sentence without considering the discourse
marker.
■ Consider another example, “Z10 Kaafi Intresting Set laga Lekin me bettry timing se thora dar gya hon
overall set acha he Lekin 20hzaar Is set pe kharch karna Kia sahe he ya koi 20hzaar tak ka set jo apki
nazar me ho jis ki ram nd storage healthy ho Ar bettry timng bi achi dy agar ap bta dein to acha hoga”
■ The overall sentiment neutral due to the connective Lekin (but), which gives more weight to the
following segment of the comment. Thus it is of utmost importance to capture all these phenomena in
a computational model.
Speech Components Short Form Rules
Sentence S SNP VP
Noun Phrase NP SS NP VP
Verb Phrase VP SS DS S
Adjective Phrase ADJP SNP ADJP
Determiner DT SØ
Noun NN NPDT NN
Verb VB NPNN CC NN
Adjective ADJ NPDT ADJ NN
Pronoun PR NPPR NP
Proper Noun PRN NP PR
Preposition PP NPPRN NP
Conjunction CC NPPRN
Discourse DS NPØ
Interjection IN VPVB S
VPVB ADJP
VPVB
VPVB PP
S(pasand) VPØ
ADJPADJ PP
PPIN NP
INØ
NP(phone )

NP (phone ) ADJP(pasand)

PR(Mujhay) PP (ye) NN(phone ) ADJ(pasand) PP(hai)

Mujhay ye phone pasand hai


What is opinion mining?
Current Work

■ Opinion mining considering the presence of discourse element through a neural


network model is being currently worked on.

You might also like