You are on page 1of 99

Analysis of Textual Data

1
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

2
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

3
Introduction to textual analysis

Textual analysis aims at extracting knowledge from collections of


unstructured text documents.

Unstructured text documents means:

● Natural language texts

● Source code

● Any other kind of textual information

4
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

5
Tasks

● Terms extraction

● Text classification / categorisation

● Clustering

● Association (of concepts)

6
Tasks
- Terms extraction

7
Tasks
- Classification &
categorisation

Is the following sentence positive, negative, neutral or none?

Risk premium is 120


It depends on the subjectivity of the receiver, e.g.
- President of the government
- Leader of the opposition
- Director of the European bank
- Investors
- People with mortgages

8
Tasks
- Classification &
categorisation

9
Tasks
- Clustering

10
Tasks
- Association (of concepts)

11
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

12
Preprocessing and feature selection
- Stemming

consign

consign ed
STEMMING
consign
consign ing

consign ment

13
Preprocessing and feature selection
- Stopwords

● List of words without


meaning

● Also named function words

○ The have syntactical


function

14
Preprocessing and feature selection
- Information Gain

IG measures the expected reduction in entropy when selecting


the examples according to a given attribute.

● T: training examples
● a: attribute
● val(a): value of attribute a
● x: example
● y: class 15
Preprocessing and feature selection
- Principal Components
Analysis

16
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

17
Linguistic resources
- Lexicons

WORD EMOTION PFA

abundance joy 0.830

harass anger 0.364

terrorise fear 0.966


18
Linguistic resources
- Gazetteers

19
Linguistic resources
- Ontologies

20
Linguistic resources
- Ontologies

21
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

22
Representation models & weighting schemes
- Vector Space Model

23
Representation models & weighting schemes
- Vector Space Model

24
Representation models & weighting schemes
- Bag of Words

● Simplest way to represent text


● A text is represented as a set of its words
● Does not consider the order of the words
● Create a vector with the size of the vocabulary seen in train
● Each sentence is represented counting the number of times
each word appears(frequency)
● Or simply counting if the word appears or not (One-hot
representation)
● Frequencies can be normalized
25
Representation models & weighting schemes
- Bag of Words

S1 = “The cat and the dog are friends”

S2 = “The dog hate the cat because they are not friends”

Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”

BoW(S1) = “2, 1, 1, 1, 1, 1, 0, 0, 0, 0”

BoW(S2) = “2, 1, 0, 1, 1, 1, 1, 1, 1, 1”

26
Representation models & weighting schemes
- Binary weighting

S1 = “The cat and the dog are friends”

S2 = “The dog hate the cat because they are not friends”

Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”

BoW(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0”

BoW(S2) = “1, 1, 0, 1, 1, 1, 1, 1, 1, 1”

27
Representation models & weighting schemes
- Frequency weighting

S1 = “The cat and the dog are friends”

S2 = “The dog hate the cat because they are not friends”

Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”

BoW(S1) = “2/7, 1/7, 1/7, 1/7, 1/7, 1/7, 0, 0, 0, 0”

BoW(S2) = “2/10, 1/10, 0, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10”

28
Representation models & weighting schemes
- TF-IDF

● Common technique from Information Retrieval (IR)

● This word representation tries to model how important a


word is to a document in a corpus

● Two terms are combined (or weighted): term frequency and


inverse document frequency

29
Representation models & weighting schemes
- TF-IDF

● Term Frequency

○ Count the number of times that term t occurs in a corpus c

● Inverse Document Frequency

○ To avoid common terms with low relevance, the idf


measures the relative importance in all the documents
number of documents where
the term t appears
30
Representation models & weighting schemes
- Statistical Language
Model

● A probability distribution over sequences of words

● Able to handle some word ordering

● Several assumptions are taken into account:

○ The probability of a word w depends on certain history


(previous words) seen

○ The Markov assumption: the probability of a word only


depends on a fixed number of the previous words (1 for
unigrams, 2 for bigrams, . . . ) 31
Representation models & weighting schemes
- Statistical Language
Model

● Given a sequence of length n, the probability that W random


variables took the values of the sequence w 1 n can be
described using the chain rule:

32
Representation models & weighting schemes
- Statistical Language
Model

● Applying the Markov assumption:

● Statistical language models can deal with out-of-vocabulary


words applying different smoothing techniques such as:

○ Backoff
○ Linear interpolation
○ Good-Turing
○ ...
33
Representation models & weighting schemes
- n-grams

● A BoW approach can be atomized to use smaller elements

● A n-gram is a continuous sequence of n elements: words or


characters

● Most common n-grams:

○ 1-gram (unigrams)

○ 2-grams (bigrams)

○ 3-grams (trigrams)
34
Representation models & weighting schemes
- n-grams

S1 = “The cat and the dog are friends”

S2 = “The dog hate the cat because they are not friends”

2-grams = “the cat, cat and, and the, the dog, dog are, are friends, dog
hate, hate the, cat because, because they, they are, are not, not friends”

2-grams(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0”

2-grams(S2) = “1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1”

35
Representation models & weighting schemes
- EmoGraph

The objective of EmoGraph is to capture the use of emotions and


their rol in the textual structure. Rangel et al. IP&M 2016

36
Representation models & weighting schemes
- EmoGraph

37
Representation models & weighting schemes
- EmoGraph

38
Representation models & weighting schemes
- EmoGraph

39
Representation models & weighting schemes
- EmoGraph

40
Representation models & weighting schemes
- EmoGraph

41
Representation models & weighting schemes
- EmoGraph

42
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

43
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

44
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

45
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

46
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

47
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

48
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

49
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

50
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

51
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

52
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

53
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

54
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

55
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

56
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

57
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

58
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

59
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

60
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

61
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

62
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

63
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

64
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

65
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

66
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

67
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

68
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

69
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

70
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

71
Representation models & weighting schemes
- EmoGraph

He estado tomando cursos en línea sobre temas valiosos que disfruto


estudiando y que podrían ayudarme a hablar en público.

72
Representation models & weighting schemes
- EmoGraph

Given a graph G={N,E} where,

○ N is the set of nodes

○ E is the set of edges

we obtain a set of:

○ structure-based features from global measures of the graph

○ node-based features from node specific measures

73
Representation models & weighting schemes
- EmoGraph

74
74
Representation models & weighting schemes
- EmoGraph

75
Representation models & weighting schemes
- EmoGraph

Graph-based 8 features

ENRatio Ratio nodes-edges


Degree Average degree
WeightedDegree Weighted average degree
Diameter Graph diameter
Density Graph density
Modularity Modularity degree
Clustering Clustering coefficient
PathLength Average length degree

Node-based 944 features (472 nodes)

BTW-xx Betweenness for each node xx


EIGEN-xx Eigenvector for each node xx

76
Representation models & weighting schemes
- EmoGraph

W: number of words in the document


S: number of POS tags
N: number of nodes
E: number of edges
77
Representation models & weighting schemes
- LDR

LDR: Low Dimensionality Representation.

The aim of LDR is at drastically reduction of the


dimensionality in order to apply textual analysis to big
data environments.

LDR obtains statistical measures from the word usage


distribution.
Rangel et al. CICLing 2016
78
Representation models & weighting schemes
- LDR

Step 1. Term-frequency - inverse document frequency (tf-idf) matrix:


- Each column is a vocabulary term t
- Each row is a document d
- wij is the tf-idf weight of the term j in the document i
- represents the assigned class c to the document i

Step 2. Class-dependent term weighting:

Step 3. Class-dependent document representation:

79
Representation models & weighting schemes
- LDR

80
Meaning of the measures
Representation models & weighting schemes
- LDR

Complexity of obtaining the features: l: number of varieties


n: number of terms of the document
m: number of terms in the document that
coincides with some term in the vocabulary
n m & l<<n
Number of features:
Representation # Features

LDR 30
Skip-gram 300

SenVec 300

BOW 10,000

Char 4-grams 10,000

tf-idf 2-grams 10,000


81
Representation models & weighting schemes
- Word embeddings

● Word2vec is a family of algorithms for learning word embeddings


(Mikolov et al. 2013)

● With Word2vec algorithms are able to learn word embedings from


unsupervised raw text.

82
Representation models & weighting schemes
- Word embeddings

Two algorithms are implemented:


● Continuous Bag-of-Words model (CBOW):
○ Predicts target words (e.g. ‘mat’) from source context words (‘the cat
sits on the’)
○ CBOW smoothes over a lot of the distributional information
○ Useful for smaller datasets
● Skip-Gram model
○ Predicts source context words (‘the cat sits on the’) from target words
(e.g. ‘mat’)
○ Treats each context-target pair as a new observation
○ Useful for larger datasets
83
Representation models & weighting schemes
- Word embeddings

These algorithms are trained using:

● Negative sampling

● Hierarchical soft-max

Maximizing the average of the log probability: Using the negative sampling estimator:

84
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

85
Learning and evaluating models

86
Learning and evaluating models
- Machine learning
algorithms

Parametric statistical modelling ● Regression (lineal, PC, PLS, logistic...)


● Discriminant analysis

Non-parametric statistical modelling ● Non-parametric regression


● Non-parametric discrimination

Associacion rules ● Association rules (simple, multilevel, sequential)


● Dependency rules

Bayesian methods ● Bayesian Networks


● Naïve Bayes
● Search algorithms (K2, B, HC)

Decision trees ● Decision trees (CART, ID3, Random Forest)


● Rules systems

Relational & structural methods ● (Inductive) logical programming


● Other (distance, trees, graphs)
87
Learning and evaluating models
- Machine learning
algorithms

Artificial neural networks ● Perceptron


● Multilayer Perceptron
● Convolutional neural networks

Support vector machines ● Support vector machines

Fuzzy logic ● Fuzzy logic


● Evolutionary computing
● Evolutionary fuzzy systems

Neibourhood-based systems ● Distance-based methods


● K-neighbours
● k-means
● Self-organised maps of Kohonen
● Learning vector quantization - LVQ

88
Learning and evaluating models
- Evaluating classification
tasks

CONFUSION Prediction
MATRIX
Negative Positive

Real Negative a b

Positive c d

True positive False positive True negative False negative

89
Learning and evaluating models
- Evaluating classification
tasks

Accuracy Precision

Recall

Sample error
F-Score

90
Learning and evaluating models
- Evaluating clustering
tasks

Based on the likelihood concept:

91
Learning and evaluating models
- Evaluating association
rules tasks

Support:
Confidence:

Lift:

92
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

93
Avoiding overfitting
- n/1-n split

0.7

0.3
94
Avoiding overfitting
- k-fold cross-validation

95
Avoiding overfitting
- Special cases

There are two common mistakes that overfits the models:

● When building the representation from the whole dataset and not
from the training set (e.g. using the entire vocabulary in bow/n-
gram models)

● When the classification/categorisation is at author level instead of


text level, and there are texts from the same author both in
training and text (e.g. language variety identification)

96
Outline

1. Introduction to textual analysis


2. Tasks
3. Preprocessing and feature selection
4. Linguistic resources
5. Representation models and weighting schemes
6. Learning and evaluating models
7. Avoiding overfitting
8. Practice with R: Language Variety Identification

97
Language Variety
Identification

GENDER AND LANGUAGE VARIETY IDENTIFICATION

ENGLISH SPANISH PORTUGUESE ARABIC

● Australia ● Argentina ● Brazil ● Egypt


● Canada ● Chile ● Portugal ● Gulf
● Great Britain ● Colombia ● Levantine
● Ireland ● Mexico ● Maghrebi
● New Zealand ● Peru
● United States ● Spain
● Venezuela

- 500 authors per gender and language variety

- 100 tweets per author http://pan.webis.de/clef17/pan17-web/author-profiling.html


98
References

On the Impact of Emotions on Author Profiling. Rangel F., Rosso P. In:


Information Processing & Management 52(1):73-92, 2016
A Low Dimensionality Representation for Language Variety Identification.
Rangel, F., Rosso, P., Franco, M. In: Proceedings of the 17th International
Conference on Intelligent Text Processing and Computational Linguistics
(CICLing’16), Springer-Verlag, LNCS(?), pp. , 2016
Distributed representations of words and phrases and their compositionality.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). In Advances
in Neural Information Processing Systems (pp. 3111-3119)

99

You might also like