You are on page 1of 19

CORPUS METHODS FOR

SEMANTIC RESEARCH (2)


SENTIMENT ANALYSIS
Lecture 13
Making statistical claims
◦ A corpus can provide reliable quantitative data
◦ Raw frequency (actual count of the number of linguistic elements within a corpus) and normalized frequency (comparable to the size of the corpora under
consideration, often common base of 1,000 words)
◦ Descriptive and inferential statistics
◦ Central tendency
◦ The mean, the mode, the median (for 4,5,6,6,7,7,7,9,9,10 – the mean is 7 (70/10); the mode is 7; the median is 7 (7+7)/2))
◦ Ways to measure the dispersion of the dataset:
◦ the range (6 (10-4)), the variance (of 4 is 3 (7-4)) and the standard deviation (1.89)

◦ Tests of statistical significance


◦ Chi-square test (compares the difference between the observed value and the expected values)
◦ Log-likelihood test (LL)

◦ Tests for significant collocations


◦ MI (mutual information) – the higher the MI score the stronger the link between two items (MI score 3 and higher means that two items are collocates)
◦ The t test (2 and higher is statistically significant)
◦ The z score (a higher z score indicates a greater degree of collocability of an item with the node word)

05/21/2021
Keywords/ Cultural keywords
◦ Keyword analysis is essentially based on the notion that recurrent ways of talking about concepts and ideas reveal something about how
we think about the social world.
◦ For R. Williams, ‘key’ in ‘keyword’ indicates that a particular concept is salient across a culture. So, for example, ‘democracy’ and
‘revolution’ are keywords for Williams.
◦ Williams (1983) is a socio-historical, diachronic dictionary of keywords where their semantic development over centuries is traced and interrelationships
explored. For this work, Williams used the complete Oxford English Dictionary (OED), which runs to several volumes.

◦ A. Wierzbicka claims that every language has "key concepts," expressed in "key words," which reflect the core values of a given culture
[Wierzbicka A. Understanding Cultures through Their Key Words (English, Russian, Polish, German, and Japanese)]
◦ Stubbs’ (1996, 2001) investigation of cultural keywords is done in the main synchronically and is informed by corpus-based methods.
◦ ‘Standard’ is one of the cultural keywords which Williams (1983) investigates using the OED. Stubbs (2001) uses a 200-million-word corpus of contemporary
English, which consists mainly of newspaper and magazine media texts, in order to highlight the most common collocates of the word ‘standard’: ‘living’,
‘high’, etc. Collocates are words which commonly accompany other words over short word spans: that is, they form a collocation such as ‘living standards’.
◦ A corpus provides objective quantitative support for the extent to which cultural keywords are being used, and the lexical company they keep. It thus
provides a measure of what meanings are being culturally reproduced.
◦ Keywords often inter-collocate, and ideas gain stability when they fit into a frame.
◦ Many everyday ideas about language fit very firmly into a frame which contains terms such as:
◦ standard, standards, accurate, correct, grammar, proper, precise
◦ For linguists, the same terms mean something quite different because they fit into an entirely different lexical field, which contains terms such as:
◦ dialect, language planning, high prestige language, social variation
05/21/2021
Corpus-comparative statistical keywords
◦ This type of keyword is defined as words which are statistically more salient in a text or set of texts than in a large reference
corpus. A keyword is ‘found to be outstanding in its frequency in the text’ (Scott 1999) by comparison to another corpus
◦ ‘Keyness’ here is established through statistical measures such as log likelihood value. Relatively high log likelihoods indicate
keywords.
◦ A keyword analysis not only indicates the ‘aboutness’ (Scott 1999) of a particular genre, it can also reveal the salient features
which are functionally related to that genre.
◦ Keywords are those words whose frequency is unusually high (positive keywords) or low (negative keywords) in comparison
with a reference corpus. The reference corpus is that it is clearly much larger than the corpora that are contrasted with it.
◦ The programs like WordSmith, AntConc compare two pre-existing word-lists, which must have been created using the WordList
tool. One of these is assumed to be a large word-list which will act as a reference file. The other is the word-list based on one text
which you want to study.
◦ The aim is to find out which words characterise the text you're most interested in, which is automatically assumed to be the
smaller of the two texts chosen. The larger will provide background data for reference comparison.

05/21/2021
KeyWords analysis in WordSmith
◦ It compares the one text file (or corpus) you're chiefly interested in, with a reference corpus often based on a lot of text or else a
comparable one. In the screenshot below we are interested in the key words of Romeo and Juliet and we're using all the Shakespeare
plays as the reference corpus to compare with.
◦ Choose Word Lists. In the dialogue box you will choose 2 files. The text file in the box above and the reference corpus file in the
box below.
◦ In the keyword list, based on the play Romeo and Juliet in comparison with all the Shakespeare plays, we see names of the main
characters, some pronouns like thou, plus theme words like love and night.

05/21/2021
A key word plot where the text is Romeo and Juliet, compared with all of
the Shakespeare plays

05/21/2021
The words in it in their contexts

05/21/2021
Key-word: Links
◦ Links are "co-occurrences of key-words within a collocational span".

05/21/2021
Сorpus linguistics suggests two types of keyness analysis
(Gabrielatos 2018):

◦ explanatory – appropriate for the corpus-based discourse study, deals with concordances showing the
context of lexical items,
◦ focused – appropriate for the corpus-driven discourse analysis, presupposes comparison of the
normalized frequency of the lexical items in two corpora to address particular research issues; attempt to
generate the keywords in the texts based on robust statistical measures without any preconceived
hypothesis.

05/21/2021
The automatic extraction of the keywords with the help of UAM
Corpus tool
◦ (http://corpustool.com/index.html)
◦ In UAM CT, the keyness of a term is calculated as the relative frequency of the term in the subcorpus of interest divided by
the relative frequency of the term in the reference corpus. Relative frequency is the count of the term in the subcorpus divided
by the number of terms in that subcorpus.
◦ Two types of keyness:
◦ Keyness-D,
◦ Keyness-S (Gabrielatos, 2018)

◦ Keyness can refer to difference (including absence)


or similarity. Keyness refers to the size of
the frequency difference.

05/21/2021
Sentiment analysis
◦ SA, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and
emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Bing Liu 2012);
◦ SA is a NLP and text mining problem which deals with computational study of opinions, sentiments and emotions expressed in text. SA
is a study of subjectivity (neutral vs emotionally loaded) and polarity (positive vs negative) of a text (Bo Pang and Lillian Lee)
◦ According to E. Hovi, SA is used to detect and retrieve the subjective information from the text.
◦ It is the process of algorithmically identifying and categorizing opinions expressed in text, determining the sentiments they convey,
classifying their polarity (positive, negative or neutral) and strength/ intensity to determine the user’s attitude toward the subject of the
document (text). This process relies on sentiment vocabulary/ lexicon, i.e. large collections of words, each marked with a positive or
negative orientation.
◦ Since the early 2000s multiple techniques for SA have been proposed, including lexicon-based approaches (e.g., General Inquirer,
WordNet Affect, QWordNet or SentiWordNet) and supervised machine learning methods (e.g., Naive Bayes, MaxEnt, Support Vector
Machine).
◦ SA can be applied at the discourse level, which presupposes that each document expresses opinions on a single entity. The sentence-level
sentiment analysis determines whether the sentence implies positive or negative opinions. The object-oriented sentiment analysis reveals
sentiment towards a specific entity mentioned in the text. The aspect-based sentiment analysis focuses on opinions relative to specific
properties (or aspects) of an entity.

05/21/2021
Sentiment analysis

05/21/2021
SA challenges
◦ Texts are domain specific, so one system trained to work with particular domain-specific texts is not suitable for texts in another
domain at all;
◦ Difficulties of natural language processing due to the very nature of the language. Language is dynamic and depends on the
context;
◦ The battery lasts a long time. (positive sentiment)
◦ The film lasts a long time. (sentiment?)
◦ Negation words or phrases, such as never, not, no, none, nothing, etc. can reverse the polarities of the opinion words. Similarly, language patterns
such as “stop + vb-ing”, “quit + vb-ing” and “cease + to-inf vb” can express negation and a negative evaluation but it depends on the social
context of the text.
◦ The latest iPhone is great (positive) - The latest iPhone is not great. (negative)
◦ My iPhone stopped working (negative) - The medicines worked. The tumour stopped growing. (positive)
◦ metaphorical expressions,
◦ sarcasm
◦ lie detection

◦ Correct detection of subjective sentences/ clauses in the text;


◦ Sentiment polarity is language-dependent.
◦ Poor materials (sentiment-annotated corpora, sentiment-lexicon) for a lot of languages, in particular Ukrainian.
05/21/2021
1.LIWC: Linguistic Inquiry and Word Count
◦ designed by James W. Pennebaker, Roger J. Booth, and Martha E. Francis;
◦ the LIWC2015 master dictionary is composed of almost 6,400 words, word stems, and selected emoticons;
◦ analyze over 70 dimensions of language
◦ 4 general descriptor categories (total word count, words per sentence, percentage of words captured by the dictionary, and percent of words longer than six
letters)
◦ 22 standard linguistic dimensions (e.g., percentage of words in the text that are pronouns, articles, auxiliary verbs, etc.)
◦ 32 word categories tapping psychological constructs (e.g., affect, cognition, biological processes)
◦ 7 personal concern categories (e.g., work, home, leisure activities)
◦ 3 paralinguistic dimensions (assents, fillers, nonfluencies)
◦ 12 punctuation categories (periods, commas, etc.)

◦ http://www.liwc.net/index.php
◦ the text analysis module was created in the Java programming language;
◦ analyzes .txt and .doc(x) files;
◦ output is given in .txt form but can be easily transferred to an excel file.

05/21/2021
2.
◦ http://sentistrength.wlv.ac.uk/
◦ Strength estimates the strength of positive and negative sentiment in short texts (or in text segments), even for informal
language. It has human-level accuracy for short social web texts in English, except political texts.;
◦ reports two sentiment strengths: -1 (not negative) to -5 (extremely negative), 1 (not positive) to 5 (extremely positive);
◦ can also report binary (positive/negative), trinary (positive/negative/neutral) and single scale (-4 to +4) results;
◦ output is a copy of the text file with positive and negative classifications added at the end of each line, preceded by tabs;
◦ algorithm is based on the information contained in various files, including
◦ The EmotionLookUpTable is a list of emotion-bearing words, each one with the word then a tab, then an integer 1 to 5 or -1 to -5.
◦ NegatingWordList.txt reverses the polarity of subsequent words -e.g., not happy is negative.
◦ BoosterWordList.txt increases sentiment intensity -e.g., very happy is more positive than happy.
◦ IdiomLookupTable.txt overrides the sentiment strength of the individual words in the phrase.

05/21/2021
EXAMPLES OF SENTIMENT DETECTION USING SENTISTRENGTH

05/21/2021
3. UAM Corpus tool
◦ http://corpustool.com/index.html
◦ Appraisal framework, designed to explore, describe and
explain evaluative uses of language, including the ways
the language is used to adopt stances, to construct
textual personas and to manage interpersonal
positioning and relationships (Martin & White 2005).
◦ Attitude system:
◦ Affect expresses a person's internal emotional state.
◦ Judgment evaluates a person's behavior in a social context.
◦ Appreciation evaluates norms about how products,
performances, and naturally occurring phenomena are
valued, when this evaluation is expressed as being a
property of the object.

05/21/2021
EXAMPLES OF MANUAL ANNOTATION USING UAM CT

05/21/2021
Заголовок Lorem Ipsum

LOREM IPSUM DOLOR SIT AMET, NUNC VIVERRA IMPERDIET PELLENTESQUE HABITANT
CONSECTETUER ADIPISCING ENIM. FUSCE EST. VIVAMUS A MORBI TRISTIQUE SENECTUS ET
ELIT. TELLUS. NETUS.

You might also like