Seth

Contents
1 Introduction 1
2 Regular Expressions, Text Normalization, Edit Distance 2
2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 27
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 N-gram Language Models 30
3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Evaluating Language Models . . . . . . . . . . . . . . . . . . . . 36
3.3 Generalization and Zeros . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Kneser-Ney Smoothing . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 The Web and Stupid Backoff . . . . . . . . . . . . . . . . . . . . 48
3.7 Advanced: Perplexity’s Relation to Entropy . . . . . . . . . . . . 49
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Naive Bayes and Sentiment Classification 56
4.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Training the Naive Bayes Classifier . . . . . . . . . . . . . . . . . 60
4.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Optimizing for Sentiment Analysis . . . . . . . . . . . . . . . . . 62
4.5 Naive Bayes for other text classification tasks . . . . . . . . . . . 64
4.6 Naive Bayes as a Language Model . . . . . . . . . . . . . . . . . 65
4.7 Evaluation: Precision, Recall, F-measure . . . . . . . . . . . . . . 66
4.8 Test sets and Cross-validation . . . . . . . . . . . . . . . . . . . . 69
4.9 Statistical Significance Testing . . . . . . . . . . . . . . . . . . . 69
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Logistic Regression 75
5.1 Classification: the sigmoid . . . . . . . . . . . . . . . . . . . . . 76
5.2 Learning in Logistic Regression . . . . . . . . . . . . . . . . . . . 80
5.3 The cross-entropy loss function . . . . . . . . . . . . . . . . . . . 81
5.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . 89
5.7 Interpreting models . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Advanced: Deriving the Gradient Equation . . . . . . . . . . . . . 91
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3
4 CONTENTS
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Vector Semantics and Embeddings 94
6.1 Lexical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Vector Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Words and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Cosine for measuring similarity . . . . . . . . . . . . . . . . . . . 103
6.5 TF-IDF: Weighing terms in the vector . . . . . . . . . . . . . . . 105
6.6 Applications of the tf-idf vector model . . . . . . . . . . . . . . . 107
6.7 Optional: Pointwise Mutual Information (PMI) . . . . . . . . . . . 108
6.8 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.9 Visualizing Embeddings . . . . . . . . . . . . . . . . . . . . . . . 115
6.10 Semantic properties of embeddings . . . . . . . . . . . . . . . . . 115
6.11 Bias and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 117
6.12 Evaluating Vector Models . . . . . . . . . . . . . . . . . . . . . . 118
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7 Neural Networks and Neural Language Models 123
7.1 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 The XOR problem . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . 129
7.4 Training Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . 132
7.5 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8 Part-of-Speech Tagging 143
8.1 (Mostly) English Word Classes . . . . . . . . . . . . . . . . . . . 143
8.2 The Penn Treebank Part-of-Speech Tagset . . . . . . . . . . . . . 146
8.3 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 148
8.4 HMM Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . 149
8.5 Maximum Entropy Markov Models . . . . . . . . . . . . . . . . . 159
8.6 Bidirectionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.7 Part-of-Speech Tagging for Morphological Rich Languages . . . . 164
8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9 Sequence Processing with Recurrent Networks 169
9.1 Simple Recurrent Neural Networks . . . . . . . . . . . . . . . . . 170
9.2 Applications of Recurrent Neural Networks . . . . . . . . . . . . 176
9.3 Deep Networks: Stacked and Bidirectional RNNs . . . . . . . . . 181
9.4 Managing Context in RNNs: LSTMs and GRUs . . . . . . . . . . 183
9.5 Words, Subwords and Characters . . . . . . . . . . . . . . . . . . 187
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10 Encoder-Decoder Models, Attention and Contextual Embeddings 191
10.1 Neural Language Models and Generation Revisited . . . . . . . . 191
10.2 Encoder-Decoder Networks . . . . . . . . . . . . . . . . . . . . . 194
CONTENTS 5
10.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.4 Applications of Encoder-Decoder Networks . . . . . . . . . . . . 200
10.5 Self-Attention and Transformer Networks . . . .

Seth

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seth

Uploaded by

Copyright:

Available Formats

Contents

You might also like