You are on page 1of 2

Cheat Sheet Extract features (dfm_*; fcm_*)

Create a document-feature matrix (dfm) from a corpus


x <- dfm(data_corpus_inaugural,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
General syntax Extensions remove = stopwords("english"))

• corpus_* manage text collections/metadata quanteda works well with these head(x, n = 2, nf = 4)
## Document-feature matrix of: 2 documents, 4 features (41.7% sparse).
• tokens_* create/modify tokenized texts companion packages: ## features
• dfm_* create/modify doc-feature matrices • readtext: an easy way to read text ## docs fellow-citizens senate house representatives
## 1789-Washington 1 1 2 2
• fcm_* work with co-occurrence matrices data
## 1793-Washington 0 0 0 0
• textstat_* calculate text-based statistics • spacyr: NLP using the spaCy library
• textmodel_* fit (un-)supervised models • quanteda.corpora: additional text Create a dictionary
• textplot_* create text-based visualizations corpora dictionary(list(negative = c("bad", "awful", "sad"),
• stopwords: multilingual stopword positive = c("good", "wonderful", "happy")))
Consistent grammar:
lists in R Apply a dictionary
• object() constructor for the object type
• object_verb() inputs & returns object type dfm_lookup(x, dictionary = data_dictionary_LSD2015)
Select features
Create a corpus from texts (corpus_*) dfm_select(x, dictionary = data_dictionary_LSD2015)
Randomly sample documents or features
Read texts (txt, pdf, csv, doc, docx, json, xml) dfm_sample(x, what = c("documents", "features"))
my_texts <- readtext::readtext("~/link/to/path/*") Weight or smooth the feature frequencies
Construct a corpus from a character vector dfm_weight(x, type = "prop") | dfm_smooth(x, smoothing = 0.5)
x <- corpus(data_char_ukimmig2010, text_field = "text")
Sort or group a dfm
Explore a corpus dfm_sort(x, margin = c("features", "documents", "both"))
summary(data_corpus_inaugural, n = 2) dfm_group(x, groups = "President")
# Corpus consisting of 58 documents, showing 2 documents:
# Text Types Tokens Sentences Year President FirstName Combine identical dimension elements of a dfm
# 1789-Washington 625 1538 23 1789 Washington George dfm_compress(x, margin = c("both", "documents", "features"))
# 1793-Washington 96 147 4 1793 Washington George
# Create a feature co-occurrence matrix (fcm)
# Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
# Created: Tue Jun 13 14:51:47 2017 x <- fcm(data_corpus_inaugural, context = "window", size = 5)
# Notes: http://www.presidency.ucsb.edu/inaugurals.php fcm_compress/remove/select/toupper/tolower are also available
Extract or add document-level variables
party <- docvars(data_corpus_inaugural, "Party")
docvars(x, "serial_number") <- 1:ndoc(x) Useful additional functions
Bind or subset corpora Locate keywords-in-context
corpus(x[1:5]) + corpus(x[7:9]) kwic(data_corpus_inaugural, "america*")
corpus_subset(x, Year > 1990)
Utility functions
Change units of a corpus texts(corpus) Show texts of a corpus
corpus_reshape(x, to = c("sentences", "paragraphs")) ndoc(corpus/dfm/tokens) Count documents/features
nfeat(corpus/dfm/tokens) Count features
Segment texts on a pattern match
summary(corpus/dfm) Print summary
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE) head(corpus/dfm) Return first part
Take a random sample of corpus texts tail(corpus/dfm) Return last part
corpus_sample(x, size = 10, replace = FALSE)
https://creativecommons.org/licenses/by/4.0/
Tokenize a set of texts (tokens_*) Fit text models based on a dfm (textmodel_*)
Tokenize texts from a character vector or corpus Correspondence Analysis (CA)
x <- tokens("Powerful tool for text analysis.", textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1)
remove_punct = TRUE, stem = TRUE)
Naïve Bayes classifier for texts
Convert sequences into compound tokens textmodel_nb(x, y = training_labels, distribution = "multinomial")
myseqs <- phrase(c("powerful", "tool", "text analysis"))
tokens_compound(x, myseqs) Wordscores text model
refscores <- c(seq(-1.5, 1.5, .75), NA))
Select tokens textmodel_wordscores(data_dfm_lbgexample, refscores)
tokens_select(x, c("powerful", "text"), selection = "keep")
Wordfish Poisson scaling model
Create ngrams and skipgrams from tokens textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5))
tokens_ngrams(x, n = 1:3)
Textmodel methods: predict(), coef(), summary(), print()
tokens_skipgrams(toks, n = 2, skip = 0:1)
Convert case of tokens
tokens_tolower(x) | tokens_topupper(x) Plot features or models (textplot_*)
Stem the terms in an object power

Plot features as a wordcloud


peace american
know years
tokens_wordstem(x)
long
creed common america
generation every spirit
data_corpus_inaugural %>% well
world
god
new can
freedom

americans
corpus_subset(President == "Obama") %>%
make
today
now oath
uspeople

less
work

dfm(remove = stopwords("english")) %>% must


words life one act
let liberty

time
future just women
citizens nation still believe

Calculate text statistics (textstat_*) textplot_wordcloud()


equal journey men
together country
government may
courage

Lexical dispersion plot

Plot the dispersion of key word(s) american

Tabulate feature frequencies from a dfm data_corpus_inaugural %>%


2001−Bush

textstat_frequency(x) | topfeatures(x) corpus_subset(Year > 1945) %>%


2005−Bush

Document
kwic("american") %>%
2009−Obama

Identify and score collocations from a tokenized text textplot_xray()


2013−Obama

toks <- tokens(c("quanteda is a pkg for quant text analysis", 0.00 0.25 0.50 0.75 1.00
2017−Trump

"quant text analysis is a growing field")) Relative token index

Plot word keyness


textstat_collocations(toks, size = 3, min_count = 2)
data_corpus_inaugural %>% dreams
protected
america

Calculate readability of a corpus corpus_subset(President %in%


american
everyone
first
back

c("Obama", "Trump")) %>%


right

textstat_readability(data_corpus_inaugural, measure = "Flesch")


country
obama Trump
know Obama

dfm(groups = "President",
still
common

Calculate lexical diversity of a dfm


freedom
journey

remove = stopwords("english")) %>%


generation

must

textstat_lexdiv(x, measure = "TTR")


can

textstat_keyness(target = "Trump") %>%


us

−10 0 10
chi2

Measure distance or similarity from a dfm textplot_keyness() Kenny FG ●


FG
ODonnell FG ●

Bruton FG ●

textstat_simil(x, "2017-Trump", method = "cosine") Quinn LAB ●


Plot Wordfish, Wordscores or CA models


Higgins LAB

LAB

Burton LAB ●

textstat_dist(x, "2017-Trump", margin = "features")


Gilmore LAB ●

textplot_scale1d(scaling_model,
Gormley Green

Green

Cuffe Green ●

Ryan Green ●

Calculate keyness statistics groups = party,


OCaolain SF ●

SF
Morgan SF ●

textstat_keyness(x, target = "2017-Trump") margin = "documents")


Lenihan FF ●

FF
Cowen FF ●

−0.10 −0.05 0.00 0.05 0.10


Document position

by Stefan Müller and Kenneth Benoit • mullers@tcd.ie, kbenoit@lse.ac.uk


Convert dfm to a non-quanteda format
https://creativecommons.org/licenses/by/4.0/ convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels",
Learn more at: http://quanteda.io • updated: 11/18 "lsa", "matrix", "data.frame))

You might also like