EUC1502 Module6 TextualAnalysis

Analysis of Textual Data
1
Outline
1. Introduction to textual analysis

2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification
2
Outline

2. Tasks
3
Introduction to textual analysis
Textual analysis aims at extracting knowledge from collections of

unstructured text documents.
Unstructured text documents means:
● Natural language texts
● Source code
● Any other kind of textual information
4
Outline

2. Tasks
5
Tasks
● Terms extraction
● Text classification / categorisation
● Clustering
● Association (of concepts)
6
Tasks
- Terms extraction
7
Tasks
- Classification &
categorisation
Is the following sentence positive, negative, neutral or none?
Risk premium is 120

It depends on the subjectivity of the receiver, e.g.
- President of the government
- Leader of the opposition
- Director of the European bank
- Investors
- People with mortgages
8
Tasks
- Classification &
categorisation
9
Tasks
- Clustering
10
Tasks
- Association (of concepts)
11
Outline

2. Tasks
12
Preprocessing and feature selection
- Stemming
consign
consign ed
STEMMING
consign
consign ing
consign ment
13
- Stopwords
● List of words without

meaning
● Also named function words
○ The have syntactical

function
14
- Information Gain
IG measures the expected reduction in entropy when selecting

the examples according to a given attribute.
● T: training examples
● a: attribute
● val(a): value of attribute a
● x: example
● y: class 15
- Principal Components
Analysis
16
Outline

2. Tasks
17
Linguistic resources
- Lexicons
WORD EMOTION PFA
abundance joy 0.830
harass anger 0.364
terrorise fear 0.966

18
- Gazetteers
19
- Ontologies
20
- Ontologies
21
Outline

2. Tasks
22
Representation models & weighting schemes
- Vector Space Model
23
- Vector Space Model
24
- Bag of Words
● Simplest way to represent text

● A text is represented as a set of its words
● Does not consider the order of the words
● Create a vector with the size of the vocabulary seen in train
● Each sentence is represented counting the number of times
each word appears(frequency)
● Or simply counting if the word appears or not (One-hot
representation)
● Frequencies can be normalized
25
- Bag of Words
S1 = “The cat and the dog are friends”
S2 = “The dog hate the cat because they are not friends”
Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”
BoW(S1) = “2, 1, 1, 1, 1, 1, 0, 0, 0, 0”
BoW(S2) = “2, 1, 0, 1, 1, 1, 1, 1, 1, 1”
26
- Binary weighting
BoW(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0”
BoW(S2) = “1, 1, 0, 1, 1, 1, 1, 1, 1, 1”
27
- Frequency weighting
BoW(S1) = “2/7, 1/7, 1/7, 1/7, 1/7, 1/7, 0, 0, 0, 0”
BoW(S2) = “2/10, 1/10, 0, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10”
28
- TF-IDF
● Common technique from Information Retrieval (IR)
● This word representation tries to model how important a

word is to a document in a corpus
● Two terms are combined (or weighted): term frequency and

inverse document frequency
29
- TF-IDF
● Term Frequency
○ Count the number of times that term t occurs in a corpus c
● Inverse Document Frequency
○ To avoid common terms with low relevance, the idf

measures the relative importance in all the documents
number of documents where
the term t appears
30
- Statistical Language
Model
● A probability distribution over sequences of words
● Able to handle some word ordering
● Several assumptions are taken into account:
○ The probability of a word w depends on certain history

(previous words) seen
○ The Markov assumption: the probability of a word only

depends on a fixed number of the previous words (1 for
unigrams, 2 for bigrams, . . . ) 31
Model
● Given a sequence of length n, the probability that W random

variables took the values of the sequence w 1 n can be
described using the chain rule:
32
Model
● Applying the Markov assumption:
● Statistical language models can deal with out-of-vocabulary

words applying different smoothing techniques such as:
○ Backoff
○ Linear interpolation
○ Good-Turing
○ ...
33
- n-grams
● A BoW approach can be atomized to use smaller elements
● A n-gram is a continuous sequence of n elements: words or

characters
● Most common n-grams:
○ 1-gram (unigrams)
○ 2-grams (bigrams)
○ 3-grams (trigrams)
34
- n-grams
2-grams = “the cat, cat and, and the, the dog, dog are, are friends, dog
hate, hate the, cat because, because they, they are, are not, not friends”
2-grams(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0”
2-grams(S2) = “1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1”
35
- EmoGraph
The objective of EmoGraph is to capture the use of emotions and

their rol in the textual structure. Rangel et al. IP&M 2016
36
- EmoGraph
37
- EmoGraph
38
- EmoGraph
39
- EmoGraph
40
- EmoGraph
41
- EmoGraph
42
- EmoGraph
He estado tomando cursos en línea sobre temas valiosos que disfruto

estudiando y que podrían ayudarme a hablar en público.
43
- EmoGraph

44
- EmoGraph

45
- EmoGraph

46
- EmoGraph

47
- EmoGraph

48
- EmoGraph

49
- EmoGraph

50
- EmoGraph

51
- EmoGraph

52
- EmoGraph

53
- EmoGraph

54
- EmoGraph

55
- EmoGraph

56
- EmoGraph

57
- EmoGraph

58
- EmoGraph

59
- EmoGraph

60
- EmoGraph

61
- EmoGraph

62
- EmoGraph

63
- EmoGraph

64
- EmoGraph

65
- EmoGraph

66
- EmoGraph

67
- EmoGraph

68
- EmoGraph

69
- EmoGraph

70
- EmoGraph

71
- EmoGraph

72
- EmoGraph
Given a graph G={N,E} where,
○ N is the set of nodes
○ E is the set of edges
we obtain a set of:
○ structure-based features from global measures of the graph
○ node-based features from node specific measures
73
- EmoGraph
74
74
- EmoGraph
75
- EmoGraph
Graph-based 8 features
ENRatio Ratio nodes-edges

Degree Average degree
WeightedDegree Weighted average degree
Diameter Graph diameter
Density Graph density
Modularity Modularity degree
Clustering Clustering coefficient
PathLength Average length degree
Node-based 944 features (472 nodes)
BTW-xx Betweenness for each node xx

EIGEN-xx Eigenvector for each node xx
76
- EmoGraph
W: number of words in the document

S: number of POS tags
N: number of nodes
E: number of edges
77
- LDR
LDR: Low Dimensionality Representation.
The aim of LDR is at drastically reduction of the

dimensionality in order to apply textual analysis to big
data environments.
LDR obtains statistical measures from the word usage

distribution.
Rangel et al. CICLing 2016
78
- LDR
Step 1. Term-frequency - inverse document frequency (tf-idf) matrix:

- Each column is a vocabulary term t
- Each row is a document d
- wij is the tf-idf weight of the term j in the document i
- represents the assigned class c to the document i
Step 2. Class-dependent term weighting:
Step 3. Class-dependent document representation:
79
- LDR
80
Meaning of the measures
- LDR
Complexity of obtaining the features: l: number of varieties

n: number of terms of the document
m: number of terms in the document that
coincides with some term in the vocabulary
n m & l<<n
Number of features:
Representation # Features
LDR 30
Skip-gram 300
SenVec 300
BOW 10,000
Char 4-grams 10,000
tf-idf 2-grams 10,000

81
- Word embeddings
● Word2vec is a family of algorithms for learning word embeddings

(Mikolov et al. 2013)
● With Word2vec algorithms are able to learn word embedings from

unsupervised raw text.
82
- Word embeddings
Two algorithms are implemented:

● Continuous Bag-of-Words model (CBOW):
○ Predicts target words (e.g. ‘mat’) from source context words (‘the cat
sits on the’)
○ CBOW smoothes over a lot of the distributional information
○ Useful for smaller datasets
● Skip-Gram model
○ Predicts source context words (‘the cat sits on the’) from target words
(e.g. ‘mat’)
○ Treats each context-target pair as a new observation
○ Useful for larger datasets
83
- Word embeddings
These algorithms are trained using:
● Negative sampling
● Hierarchical soft-max
Maximizing the average of the log probability: Using the negative sampling estimator:
84
Outline

2. Tasks
85
Learning and evaluating models
86
- Machine learning
algorithms
Parametric statistical modelling ● Regression (lineal, PC, PLS, logistic...)

● Discriminant analysis
Non-parametric statistical modelling ● Non-parametric regression

● Non-parametric discrimination
Associacion rules ● Association rules (simple, multilevel, sequential)

● Dependency rules
Bayesian methods ● Bayesian Networks

● Naïve Bayes
● Search algorithms (K2, B, HC)
Decision trees ● Decision trees (CART, ID3, Random Forest)

● Rules systems
Relational & structural methods ● (Inductive) logical programming

● Other (distance, trees, graphs)
87
- Machine learning
algorithms
Artificial neural networks ● Perceptron

● Multilayer Perceptron
● Convolutional neural networks
Support vector machines ● Support vector machines
Fuzzy logic ● Fuzzy logic

● Evolutionary computing
● Evolutionary fuzzy systems
Neibourhood-based systems ● Distance-based methods

● K-neighbours
● k-means
● Self-organised maps of Kohonen
● Learning vector quantization - LVQ
88
- Evaluating classification
tasks
CONFUSION Prediction
MATRIX
Negative Positive
Real Negative a b
Positive c d
True positive False positive True negative False negative
89
- Evaluating classification
tasks
Accuracy Precision
Recall
Sample error
F-Score
90
- Evaluating clustering
tasks
Based on the likelihood concept:
91
- Evaluating association
rules tasks
Support:
Confidence:
Lift:
92
Outline

2. Tasks
93
Avoiding overfitting
- n/1-n split
0.7
0.3
94
- k-fold cross-validation
95
- Special cases
There are two common mistakes that overfits the models:
● When building the representation from the whole dataset and not
from the training set (e.g. using the entire vocabulary in bow/n-
gram models)
● When the classification/categorisation is at author level instead of

text level, and there are texts from the same author both in
training and text (e.g. language variety identification)
96
Outline

2. Tasks
97
Language Variety
Identification
GENDER AND LANGUAGE VARIETY IDENTIFICATION
ENGLISH SPANISH PORTUGUESE ARABIC
● Australia ● Argentina ● Brazil ● Egypt

● Canada ● Chile ● Portugal ● Gulf
● Great Britain ● Colombia ● Levantine
● Ireland ● Mexico ● Maghrebi
● New Zealand ● Peru
● United States ● Spain
● Venezuela
- 500 authors per gender and language variety
- 100 tweets per author http://pan.webis.de/clef17/pan17-web/author-profiling.html

98
References
On the Impact of Emotions on Author Profiling. Rangel F., Rosso P. In:

Information Processing & Management 52(1):73-92, 2016
A Low Dimensionality Representation for Language Variety Identification.
Rangel, F., Rosso, P., Franco, M. In: Proceedings of the 17th International
Conference on Intelligent Text Processing and Computational Linguistics
(CICLing’16), Springer-Verlag, LNCS(?), pp. , 2016
Distributed representations of words and phrases and their compositionality.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). In Advances
in Neural Information Processing Systems (pp. 3111-3119)
99

EUC1502 Module6 TextualAnalysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EUC1502 Module6 TextualAnalysis

Uploaded by

Copyright:

Available Formats

Analysis of Textual Data

1. Introduction to textual analysis

1. Introduction to textual analysis

Textual analysis aims at extracting knowledge from collections of

Unstructured text documents means:

● Natural language texts

● Any other kind of textual information

1. Introduction to textual analysis

● Text classification / categorisation

● Association (of concepts)

Is the following sentence positive, negative, neutral or none?

Risk premium is 120

1. Introduction to textual analysis

● List of words without

● Also named function words

○ The have syntactical

IG measures the expected reduction in entropy when selecting

1. Introduction to textual analysis

WORD EMOTION PFA

abundance joy 0.830

harass anger 0.364

terrorise fear 0.966

1. Introduction to textual analysis

● Simplest way to represent text

S1 = “The cat and the dog are friends”

S1 = “The cat and the dog are friends”

S1 = “The cat and the dog are friends”

BoW(S1) = “2/7, 1/7, 1/7, 1/7, 1/7, 1/7, 0, 0, 0, 0”

● Common technique from Information Retrieval (IR)

● This word representation tries to model how important a

● Two terms are combined (or weighted): term frequency and

○ Count the number of times that term t occurs in a corpus c

● Inverse Document Frequency

○ To avoid common terms with low relevance, the idf

● A probability distribution over sequences of words

● Able to handle some word ordering

● Several assumptions are taken into account:

○ The probability of a word w depends on certain history

○ The Markov assumption: the probability of a word only

● Given a sequence of length n, the probability that W random

● Applying the Markov assumption:

● Statistical language models can deal with out-of-vocabulary

● A BoW approach can be atomized to use smaller elements

● A n-gram is a continuous sequence of n elements: words or

● Most common n-grams:

S1 = “The cat and the dog are friends”

The objective of EmoGraph is to capture the use of emotions and

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto

He estado tomando cursos en línea sobre temas valiosos que disfruto