Professional Documents
Culture Documents
From Algolit
Naive Bayes, Support Vector Machines and Linear Regression are called classical machine
learning algorithms. They perform well when learning with small datasets. But they often require
complex Readers. The task the Readers do, is also called feature-engineering. This means that a
human needs to spend time on a deep exploratory data analysis of the dataset.
Features can be the frequency of words or letters, but also syntactical elements like nouns,
adjectives, or verbs. The most significant features for the task to be solved, must be carefully
selected and passed over to the classical machine learning algorithm. This process marks the
difference with Neural Networks. When using a neural network, there is no need for feature-
engineering. Humans can pass the data directly to the network and achieve fairly good
performances straightaway. This saves a lot of time, energy and money.
The downside of collaborating with Neural Networks is that you need a lot more data to train your
prediction model. Think of 1GB or more of plain text files. To give you a reference, 1 A4, a text file
of 5000 characters only weighs 5 KB. You would need 8,589,934 pages. More data also requires
more access to useful datasets and more, much more processing power.
Contents
1 Character n-gram for authorship recognition
1.1 Reference
2 A history of n-grams
3 God in Google Books
4 Grammatical features taken from Twitter influence the stock market
4.1 Reference
5 Bag of words
Reference
Paper: On the Robustness of Authorship Attribution Based on Character N-gram Features (ht
tps://brooklynworks.brooklaw.edu/cgi/viewcontent.cgi?article=1048&context=jlp), Efstathios
Stamatatos, in Journal of Law & Policy, Volume 21, Issue 2, 2013.
News article: https://www.scientificamerican.com/article/how-a-computer-program-helped-
show-jk-rowling-write-a-cuckoos-calling/ (https://www.scientificamerican.com/article/how-a-
computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/)
A history of n-grams
The n-gram algorithm can be traced back to the work of Claude Shannon in information theory. In
the paper, 'A Mathematical Theory of Communication (http://www.math.harvard.edu/~ctm/home/t
ext/others/shannon/entropy/entropy.pdf)', published in 1948, Shannon performed the first
instance of an n-gram-based model for natural language. He posed the question: given a
sequence of letters, what is the likelihood of the next letter?
If you read the following excerpt, can you tell who it was written by? Shakespeare or an n-gram
piece of code?
SEBASTIAN: Do I stand till the break off.
BIRON: Hide thy head.
VENTIDIUS: He purposeth to Athens: whither, with the vow I made to handle you.
/
FALSTAFF: My good knave.
You may have guessed, considering the topic of this story, that an n-gram algorithm generated
this text. The model is trained on the compiled works of Shakespeare. While more recent
algorithms, such as the recursive neural networks of the CharNN, are becoming famous for their
performance, n-grams still execute a lot of NLP tasks. They are used in statistical machine
translation, speech recognition, spelling correction, entity detection, information extraction, ...
/
Reference
Paper: 'Grammatical Feature Extraction and Analysis of Tweet Text: An Application towards
Predicting Stock Trends' (http://courses.cecs.anu.edu.au/courses/CSPROJECTS/18S1/reports/u
6013799.pdf), Haikuan Liu, Research School of Computer Science (RSCS), College of
Engineering and Computer Science (CECS), The Australian National University (ANU)
Bag of words
In Natural Language Processing (NLP), 'bag of words' is considered to be an unsophisticated
model. It strips text of its context and dismantles it into a collection of unique words. These words
are then counted. In the previous sentences, for example, 'words' is mentioned three times, but
this is not necessarily an indicator of the text's focus.
The first appearance of the expression 'bag of words' seems to go back to 1954. Zellig Harris (htt
ps://en.wikipedia.org/wiki/Zellig_Harris), an influential linguist, published a paper called
'Distributional Structure'. In the section called 'Meaning as a function of distribution', he says 'for
language is not merely a bag of words but a tool with particular properties which have been
fashioned in the course of its use. The linguist's work is precisely to discover these properties,
whether for descriptive analysis or for the synthesis of quasi-linguistic systems.'