Professional Documents
Culture Documents
Abstract
Computational methods often produce large amounts of data about texts,
which create theoretical and practical challenges for textual interpretation.
How can we make claims about texts, when we cannot read every text or analyze
every piece of data produced? This article draws on rhetorical and literary
theories of textual interpretation to develop a hermeneutical theory for gaining
insight about texts with large amounts of computational data. It proposes that
computational data about texts can be thought of as analytical lenses that make
certain textual features salient. Analysts can read texts with these lenses, and
argue for interpretations by arguing for how the analyses of many pieces of data
support a particular understanding of text(s). By focusing on validating an
understanding of the corpus rather than explaining every piece of data, we allow
space for close reading by the human reader, focus our contributions on the
humanistic insight we can gain from our corpora, and make it possible to glean
insight in a way that is feasible for the limited human reader while still having
Correspondence: Hannah
strategies to argue for (or against) certain interpretations. This theory is dem-
Ringler, Carnegie Mellon
University, Pittsburgh, PA onstrated with an analysis of academic writing using stylometry methods, by
15213, USA. E-mail: offering a view of knowledge-making processes in the disciplines through a close
hringler@andrew.cmu.edu analysis of function words.
.................................................................................................................................................................................
critic in the unfolding of interpretive possibilities’ methodological tensions in working with texts in
(Ramsay, 2011, p. 10), but when faced with tables particular.
upon tables of numerical data about a corpus or text, In this article, I use stylometry methods as a case
the realities of how to actually make that happen are study for investigating hermeneutic questions in com-
often unclear. putational humanities work: when faced with large
Stylometry is one example of a method that has amounts of numerical data, how do we make sense
been especially tricky to tackle in terms of interpret- of it in a way that helps us to interpret artifacts and
ation because of the large amount of data produced. engage in humanistic inquiry, even though we may
also a fruitful, if difficult, space. Many scholars have With very large textual corpora though, serial
argued for the value of combining the two approaches, reading for a particular feature quickly becomes an
arguing, for example, that ‘the macroscale perspective unmanageable task. An analyst might serial read
should inform our close readings of the individual through a handful of texts and develop ideas about
texts’ (Jockers, 2013, p. 28), that artificial views of a why certain textual patterns occur, but how do we
text ‘encourage a reader to read it differently’ know that the reasons scale? As corpora get larger,
(Rockwell and Sinclair, 2016, p. 189), or that macro statistical differences become more stable but the
approaches can help us model and theorize about the analyst’s ability to read each text closely wanes.
and context (Ricoeur, 1981). Second, the reader expe- they might reveal. After we conduct computational
riences pre-understanding, which is the state of mind analyses, we are then in a space to start making
that the reader encounters a text with (Ricoeur, 1981). guesses about what those numbers mean. But at
For any text, a reader might have some knowledge or this point, we are only in Ricoeur’s ‘naı̈ve under-
preconceived notions of the topic, author, and so standing’. It is not until we return to reading indi-
forth, which shape their impression of the text before vidual texts that we can start the process of analysis/
they ever read it. Next, the reader experiences naı̈ve explanation and making sense of bits and pieces of
understanding, or their initial, even visceral and im- data (e.g. making sense of why the is one of many
many pieces of data support a particular understanding of pieces of data to be analyzed that all point to one
of a text or corpus. Indeed, we can even conduct add- larger understanding.
itional computational analyses and create more data so In the remainder of this section, I first describe the
as to cross-validate our understanding with new forms corpus used for analysis and then detail the stylometry
of data analysis that we can test our understandings analysis and analysis of separate pieces. I then present
(rather than specific analyses) against. This approach this analysis and show how many pieces of data can be
to hermeneutics has resemblances to Piper’s (2015) analyzed to argue for broader understandings of the
model of computational hermeneutics wherein a reader corpus.
rather than distinguishing disciplines in terms of the Educational Researcher, and Teaching and Teacher
specific content matter that they engage with, writing Education (five years * five journals * thirteen disci-
theorists like Carter (2007) have found it more pro- plines ¼ 1,625 articles). In total, the corpus contains
ductive to conceive of disciplines as ‘ways of knowing 13,604,023 words.
and doing’ rather than static repositories of know- Each text was formatted as a .txt file and cleaned by
ledge. For example, Carter categorizes disciplines hand to remove footnotes, endnotes, appendices,
like animal science, accounting, and engineering as headers and footers, bibliographies, page numbers,
‘problem-solving’ disciplines, and other disciplines and other unreadable characters. Image captions and
of disciplines on the left (sciences) and right (human- to build the whole tree. In this sense, it privileges the
ities). These groupings indicate similar frequencies of first-nearest neighbor when mapping how close texts
function words between groups of similar disciplines, are. The diagram in Fig. 1 is created using a technique
or metadisciplines (Carter, 2007). described by Eder (2017), and is especially informative
The split between the humanities and sciences because the connections depicted take into account
becomes even more pronounced when the distances not only the first-nearest neighbor, but also the second
are visualized on a traditional dendrogram using clus- and third. In that sense, Fig. 1 gives a more nuanced
ter analysis, as in Fig. 2. Figure 2 is created using the view as to the many relationships between texts, while
same Burrows’ Delta measurements used for Fig. 1. Fig. 2 allows for an easier visualization of groups.
The dendrogram in Fig. 2 is created by drawing con- An analysis could focus on any level of grouping
nections between whichever two texts have the smallest from comparing two subcorpora to comparing every
distance between them, and then working up this way individual text. I choose a high-level, two subcorpora
XP A l B l
split (suggested by the groupings in Figs 1 and espe- i i i i
DðiÞ ¼
cially 2) to prioritize a broad view of the corpus and c¼1 ri ri
manageability of analysis. To analyze this split, I focus
on the words that drive the difference between the where D ¼ the total distance contributed by a function
humanities and sciences. These words are identified word i summed over every pair of humanities and
by calculating the distance contributed by each func- science texts; i ¼ the function word being analyzed;
tion word to the distance between every possible pair c ¼ the combination of humanities and science texts
of humanities and science texts, and then summing up being compared; P ¼ the total number of unique pairs
the total distances by word. To isolate the distance of humanities and sciences texts (for this particular
contributed by each word throughout the whole cor- corpus, that is 625,000 possible unique pairs); A, B
pus, I slightly modify the traditional Burrows’ Delta ¼ texts being compared; Ai ¼ frequency of i in text
formula to the following: A; Bi ¼ frequency of i in text B; li ¼ mean frequency
Table 1. The top eleven function words that contribute most to the distance between the sciences and humanities
Word D Subcorpus Log-likelihood Effect size N
the 794,889.6 Sciences 286.41 0.053 864,090
with 786,419.2 Sciences 1,367.87 0.347 98,942
for 784,973.8 Sciences 1,599.56 0.319 137,746
but 778,957.4 Humanities 5,001.60 1.372 32,883
by 759,898.5 Sciences 385.97 0.205 81,460
not 756,177.8 Humanities 5,471.20 1.045 60,362
The D column shows the sum of the Delta distances between each science and humanities text for that particular function word. The
subcorpus column shows which subcorpus (sciences or humanities) that the function word is most common in, based on which subcorpus
has the highest average z-score for that word. The table also shows the log-likelihood measure of the word for the subcorpus it is most
common in (using the less-common subcorpus as a reference corpus), the effect size of the word for the subcorpus it is most common in
using log-ratio (Hardie, 2014), and the total number of instances in the corpus (N).
of i in the corpus; ri ¼ standard deviation of frequen- rhetorical strategies that support different under-
cies of i in the corpus. standings of the corpus; and by using the descriptions
Table 1 demonstrates which words contribute of function words in grammar resources like the
the most to the total distance between sciences Longman Grammar (Biber et al., 1999) to identify cer-
and humanities texts in the corpus. This table forms tain rhetorical patterns that the words were used in.
the basis of the textual analysis, methodologically While it was not feasible in most cases to verify the
detailed in Section 3.1.3. precise frequency of usage patterns (outside of mas-
sive hand-coding efforts), the various strategies listed
3.1.3 Textual analysis here allowed for a refinement and testing of the under-
For the top four function words in each subcorpus, standings of the corpora that the function words sug-
several individual texts were identified that had the gested. Ultimately, the function word analysis allowed
most average z-scores for both subcorpora on each for new insights into knowledge-making processes as
word. I serial read through each of these texts, taking they occur textually across disciplines and expansions
note on the first read of the general argument and of existing insights that other types of analyses have
structure of the texts. Using a concordance table, I suggested on their own, by drawing attention to very
then re-read through the texts paying attention to specific grammatical patterns and rhetorical moves
where the function words tended to be used and in that regularly occurred throughout the texts.
what kinds of constructions. While function words are
indeed often used in a variety of different ways (for 3.2 Analysis: A theory of knowledge
example, to can be used as part of an infinitive and a production in the humanities and sciences
preposition), automatically tagging for these patterns When we look at the function words in Table 1, then,
is difficult to do accurately and other approaches can what do these words reveal about the work of these
be used to pull out more interpretively useful patterns. disciplines? What kinds of theories of knowledge pro-
In particular, as patterns of usage emerged and sug- duction across the sciences and humanities do they
gested understandings of the corpora, I then tested support? Beginning with the sciences, the analysis sug-
those understandings in several ways: comparing gests that much of scientific writing is heavily engaged
them to how other function words were being used; in description of the physical and technological
conducting rhetorical analyses with Docuscope3 worlds. On average, at least 5.4% of science papers
(Ishizaki and Kaufer, 2012), which helped to identify are made up of descriptive language, as opposed to
only 3.6% in the humanities.4 This description ultim- interpretations or meanings of those specified objects,
ately forms the basis of analysis that supports building and create only one true, possible object.
precise models and theories about these worlds that This function of the might be seen more clearly
are more generalizable and useful for future applica- when contrasted to article usage in the humanities.
tions. The process of creating knowledge in this way is While the of course exists in humanities writing, it is
represented by the prevalence of the, with, and for in slightly less frequent because objects are often framed
the sciences, per Table 1. as plural and many. For example, this sociology paper
In the sciences broadly, the primarily supports describes models created to represent how separated
highly distinctive of the sciences point to a process of order to allow isolate them or add relevant detail that
knowledge creation itself that focuses on describing is useful for analysis. For example, in a history paper
the localized, specific physical and technological analyzing the collection and exhibition of photo-
worlds to build models and theories of them. graphs during the Second World War liberation of
The humanities appear to operate quite differently, Paris, the author writes the following about German
in that they are fundamentally interested in grappling soldiers:
with and understanding the human experience. They
But while public photography was thus restricted
might make theoretical claims about what humans do
for the French, the German soldiers who visited
allows both the movement of characters from one This analysis produced humanistic insight into
statement of action to another, and the intimate con- a large corpus that was not only feasible, but useful
necting of those two statements of action. In this case, for understanding the broad knowledge-making
it is clear that the memory being depicted in the se- processes that occur across disciplines and the
cond scene is related to the evacuation of Yu-Chiung rhetorical moves that undergird them. Moreover,
and her grandmother, primarily because of the con- though, this analysis illustrated both what it looks
necting of the two sentences through the relative pro- like to develop a defensible interpretation of a
noun: if the clauses were split into two sentences, Hu computational model of a corpus when we simply
within the overall argument of the text, while the dis- up distant reads to allow more clearly for close reading
tant reading can complement the close by drawing at- that can lead towards stronger humanistic insight of
tention to patterns to be noticed at scale and giving corpora.
additional evidence for possible understandings.
Refocusing hermeneutics for computational
text analysis with a goal of gathering evidence for
an understanding is a strategy to be developed
Notes
more fully for different methods. Stylometry models 1. Computer science was the only discipline gathered
from conference papers as opposed to journal
Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi: Hardie, A. (2014). Log ratio–an informal introduction.
an open source software for exploring and manipulating http://cass.lancs.ac.uk/log-ratio-an-informal-introduc
networks. International AAAI Conference on Weblogs and tion/ (accessed 19 February 2019).
Social Media, 8: 361–62. Heuser, R. and Le-Khac, L. (2011). Learning to read data:
Baumann, A., Bazzi, S., Rompotis, D., et al. (2017). bringing out the humanistic in the digital humanities.
Weak-field few-femtosecond VUV photodissociation dy- Victorian Studies, 54(1): 79–86.
namics of water isotopologues, Physical Review A, 96(1): Hockey, S. (2000). Electronic Texts in the Humanities:
1–7. Principles and Practice. Oxford: Oxford University Press.
Mosteller, F. and Wallace, D. L. (1964). Inference and Disputed Ricoeur, P. (1991). From Text to Action: Essays
Authorship: The Federalist. Reading: Addison-Wesley. in Hermeneutics, II, Blamey, K. and Thompson, J. B. (eds
National Center for Education Statistics. (2010) and trans). Evanston: Northwestern University Press.
Classification of Instructional Programs – 2010. https:// Rockwell, G. and Sinclair, S. (2016). Hermeneutica:
nces.ed.gov/ipeds/cipcode/resources.aspx?y¼55 Computer-Assisted Interpretation in the Humanities.
(accessed 29 September 2020). Cambridge: MIT Press.
Noecker, J., Jr., Ryan, M., and Juola, P. (2013). SCImago. (n.d.). SJR – SCImago Journal & Country Rank
Psychological profiling through textual analysis. [Portal]. http://www.scimagojr.com (accessed 7 August