Professional Documents
Culture Documents
• corpus_* manage text collections/metadata quanteda works well with these head(x, n = 2, nf = 4)
## Document-feature matrix of: 2 documents, 4 features (41.7% sparse).
• tokens_* create/modify tokenized texts companion packages: ## features
• dfm_* create/modify doc-feature matrices • readtext: an easy way to read text ## docs fellow-citizens senate house representatives
## 1789-Washington 1 1 2 2
• fcm_* work with co-occurrence matrices data
## 1793-Washington 0 0 0 0
• textstat_* calculate text-based statistics • spacyr: NLP using the spaCy library
• textmodel_* fit (un-)supervised models • quanteda.corpora: additional text Create a dictionary
• textplot_* create text-based visualizations corpora dictionary(list(negative = c("bad", "awful", "sad"),
• stopwords: multilingual stopword positive = c("good", "wonderful", "happy")))
Consistent grammar:
lists in R Apply a dictionary
• object() constructor for the object type
• object_verb() inputs & returns object type dfm_lookup(x, dictionary = data_dictionary_LSD2015)
Select features
Create a corpus from texts (corpus_*) dfm_select(x, dictionary = data_dictionary_LSD2015)
Randomly sample documents or features
Read texts (txt, pdf, csv, doc, docx, json, xml) dfm_sample(x, what = c("documents", "features"))
my_texts <- readtext::readtext("~/link/to/path/*") Weight or smooth the feature frequencies
Construct a corpus from a character vector dfm_weight(x, type = "prop") | dfm_smooth(x, smoothing = 0.5)
x <- corpus(data_char_ukimmig2010, text_field = "text")
Sort or group a dfm
Explore a corpus dfm_sort(x, margin = c("features", "documents", "both"))
summary(data_corpus_inaugural, n = 2) dfm_group(x, groups = "President")
# Corpus consisting of 58 documents, showing 2 documents:
# Text Types Tokens Sentences Year President FirstName Combine identical dimension elements of a dfm
# 1789-Washington 625 1538 23 1789 Washington George dfm_compress(x, margin = c("both", "documents", "features"))
# 1793-Washington 96 147 4 1793 Washington George
# Create a feature co-occurrence matrix (fcm)
# Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
# Created: Tue Jun 13 14:51:47 2017 x <- fcm(data_corpus_inaugural, context = "window", size = 5)
# Notes: http://www.presidency.ucsb.edu/inaugurals.php fcm_compress/remove/select/toupper/tolower are also available
Extract or add document-level variables
party <- docvars(data_corpus_inaugural, "Party")
docvars(x, "serial_number") <- 1:ndoc(x) Useful additional functions
Bind or subset corpora Locate keywords-in-context
corpus(x[1:5]) + corpus(x[7:9]) kwic(data_corpus_inaugural, "america*")
corpus_subset(x, Year > 1990)
Utility functions
Change units of a corpus texts(corpus) Show texts of a corpus
corpus_reshape(x, to = c("sentences", "paragraphs")) ndoc(corpus/dfm/tokens) Count documents/features
nfeat(corpus/dfm/tokens) Count features
Segment texts on a pattern match
summary(corpus/dfm) Print summary
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE) head(corpus/dfm) Return first part
Take a random sample of corpus texts tail(corpus/dfm) Return last part
corpus_sample(x, size = 10, replace = FALSE)
https://creativecommons.org/licenses/by/4.0/
Tokenize a set of texts (tokens_*) Fit text models based on a dfm (textmodel_*)
Tokenize texts from a character vector or corpus Correspondence Analysis (CA)
x <- tokens("Powerful tool for text analysis.", textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1)
remove_punct = TRUE, stem = TRUE)
Naïve Bayes classifier for texts
Convert sequences into compound tokens textmodel_nb(x, y = training_labels, distribution = "multinomial")
myseqs <- phrase(c("powerful", "tool", "text analysis"))
tokens_compound(x, myseqs) Wordscores text model
refscores <- c(seq(-1.5, 1.5, .75), NA))
Select tokens textmodel_wordscores(data_dfm_lbgexample, refscores)
tokens_select(x, c("powerful", "text"), selection = "keep")
Wordfish Poisson scaling model
Create ngrams and skipgrams from tokens textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5))
tokens_ngrams(x, n = 1:3)
Textmodel methods: predict(), coef(), summary(), print()
tokens_skipgrams(toks, n = 2, skip = 0:1)
Convert case of tokens
tokens_tolower(x) | tokens_topupper(x) Plot features or models (textplot_*)
Stem the terms in an object power
americans
corpus_subset(President == "Obama") %>%
make
today
now oath
uspeople
less
work
time
future just women
citizens nation still believe
Document
kwic("american") %>%
2009−Obama
toks <- tokens(c("quanteda is a pkg for quant text analysis", 0.00 0.25 0.50 0.75 1.00
2017−Trump
dfm(groups = "President",
still
common
−10 0 10
chi2
FG
ODonnell FG ●
●
Bruton FG ●
●
LAB
●
●
Burton LAB ●
●
textplot_scale1d(scaling_model,
Gormley Green
Green
●
●
Cuffe Green ●
●
Ryan Green ●
●
SF
Morgan SF ●
●
FF
Cowen FF ●
●