Professional Documents
Culture Documents
Barbara McGillivray
Gábor Mihály Tóth
Applying Language Technology in Humanities
Research
Barbara McGillivray · Gábor Mihály Tóth
Applying
Language
Technology
in Humanities
Research
Design, Application, and the Underlying Logic
Barbara McGillivray Gábor Mihály Tóth
Faculty of Modern and Medieval Viterbi School of Engineering, Signal
Languages Analysis Lab (SAIL)
University of Cambridge University of Southern California
Cambridge, UK Los Angeles, CA, USA
The Alan Turing Institute
London, UK
This Palgrave Macmillan imprint is published by the registered company Springer Nature
Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The idea of this book goes back to the HiCor Research Network, founded
and led by us together with Gard Jenset and Kerry Russell. HiCor was
a research group of historians and corpus linguists at the University of
Oxford active between 2012 and 2014. It was generously supported by
TORCH (The Oxford Research Center for the Humanities). In addition
to organizing lectures and a workshop, HiCor also aimed to disseminate
language technology among historians and, more generally, humanists.
For instance, we organized several courses on Language Technology and
Humanities at the Oxford DH Summer School, which inspired this book.
We are grateful to Gard Jenset who helped to shape the initial ideas
underlying this book. We also thank our employers and funders for pro-
viding us with time and funding to accomplish the project.1
We have contributed equally to the design of the book. We have joint
responsibility for Chapter 1. Barbara McGillivray has primary responsi-
bility for Chapters 2 and 5. Gábor Tóth has primary responsibility for
Chapters 3, 4, 6, and 7.
Engineering. Barbara McGillivray was supported by The Alan Turing Institute under
EPSRC grant EP/N510129/1.
v
Contents
vii
viii CONTENTS
3 Frequency 35
3.1 Concept of Frequency 36
3.2 Application: The “Characteristic Vocabulary”
of the Moonstone by Wilkie Collins 39
3.3 Application: Terms with ‘Turbulent History’ in the Early
English Books Online 43
3.4 Conclusion 46
References 46
4 Collocation 47
4.1 The Concept of Collocation 48
4.2 Probability of a Bigram 49
4.3 Observed and Expected Probability of a Bigram 50
4.4 Strength of Association: Pointwise Mutual Information
(PMI) 52
4.5 Strength of Association: Log Likelihood Ratio 54
4.6 Application: What Residents of Modern London
Complained About 54
4.7 Conclusion 58
References 59
Index 123
List of Figures
xi
xii LIST OF FIGURES
xiii
CHAPTER 1
1 https://www.dhoxss.net/.
2 https://dhsi.org.
3 http://esu.culintec.de/.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 3
4 The Python implementation can be found in the following github repository: https://
github.com/toth12/language-technology-humanities.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 5
also mean that many details concerning the topics covered were omit-
ted. However, we aimed to provide basic information to further explore
themes that are of particular interest to readers.
References
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. Sebastopol, CA: O’Reilly.
Gries, S. T. (2009). Quantitative Corpus Linguistics with R. New York, NY and
Abingdon: Routledge.
Hockey, S. (2000). Electronic Texts in the Humanities: Principles and Practice.
Oxford: Oxford University Press.
Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History.
Champaign, IL: University of Illinois Press.
Jockers, M. L. (2014). Text Analysis with R for Students of Literature. New York,
NY: Springer.
Moretti, F. (2015). Distant Reading. London: Verso.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts.
San Rafael, CA: Morgan and Claypool.
CHAPTER 2
Abstract This chapter guides the reader through the key stages of
creating language resources. After explaining the difference between lin-
guistic corpora and other text collections, the authors briefly introduce
the typology of corpora created by corpus linguists and the concept of
corpus annotation. Basic terminology from natural language processing
(NLP) and corpus linguistics is introduced, alongside an explanation of the
main components of an NLP pipeline and tools, including p re-processing,
part-of-speech tagging, lemmatization, and entity extraction.
1 https://digital.nls.uk/encyclopaedia-britannica/archive/144133900.
2 DESIGN OF TEXT RESOURCES AND TOOLS 9
2 https://www.darwinproject.ac.uk.
3 https://www.dhi.ac.uk/hartlib/context.
4 We will follow the Oxford Dictionaries in using metadata as a mass noun and data as a
plural noun.
2 DESIGN OF TEXT RESOURCES AND TOOLS 11
over time? Did Hartlib use a different style when addressing certain
personalities? Does the length of the letters tend to change over time?
Metadata can be of different types, depending on the kind of infor-
mation it provides. We follow the categorization in Burnard (2005)
and distinguish between descriptive, administrative, editorial, and ana-
lytic metadata. The scope of the first two categories is the collection as
a whole, while the latter ones apply to smaller text units. Descriptive
metadata accesses external information about the context of the text,
such as its source, date of publication, and the sociodemographics of the
authors. Administrative metadata contains information about the collec-
tion itself, for example its title, its version, encoding, and so on. Editorial
metadata, on the other hand, provides information about the editorial
choices that the creators of the digital collection made with respect to
the original text, for example regarding additions, omissions, or correc-
tions. Finally, analytic metadata focuses on the structure of the text, for
example by marking the beginning and end of sections or paragraphs.
Metadata can be encoded into text resources in various ways, either
in external documentation or as part of the collections themselves. The
Text Encoding Initiative (TEI) has developed detailed guidelines for the
encoding of texts in digital format and it has become a widely accepted
standard in the digital humanities. The TEI guidelines specify, among
other things, how the metadata of a text should be displayed in what is
known as the TEI header (for details see TEI Consortium 2019).
As we have said earlier, metadata combined with text data offers the
widest scope for insightful ways to explore texts. Moreover, the texts
themselves can be enriched via annotation to optimize the implicit lin-
guistic information they contain and make it usable for large-scale anal-
yses. Let us imagine that we have access to a large collection of digitized
newspapers and we are interested in analysing the level of international
relations exemplified in this collection. Knowing the geographical ori-
gin of each newspaper is of primary importance, but it is not sufficient
because a newspaper article may talk about a location which is differ-
ent from its place of publication. Hence, we would want to conduct
an in-depth search of the texts to find, for example, instances of place
names. This can be a very time-consuming (or sometimes impossible)
process if we need to read all the articles. Without good disambigua-
tion, we may have to ignore many instances of potentially irrelevant hits
while at the same time missing a high number of relevant hits. For exam-
ple, Paris is the name of the French capital but is also the name of a
12 B. McGILLIVRAY AND G. M. TÓTH
city in Texas, and being able to distinguish the two means that we can
know whether a particular mention refers to international relationships
with France or the United States. Moreover, Paris can also be a per-
son’s name, and at the same time the city can be referred to in different
ways (e.g., ‘the City of Lights’), so again being able to disambiguate the
usages of this name in context is very useful.
As noted by McEnery and Wilson (2001, p. 32), annotation makes
the linguistic information in a text computationally retrievable, thus ena-
bling a wide range of searches that can be performed in a manual, auto-
matic, semi-automatic, or crowd-sourced way, depending on whether
humans, computers, a combination of humans and computers, or groups
of humans are responsible for it. For a detailed overview of linguistic
annotation, see Jenset and McGillivray (2017, pp. 99 ff.). In Sect. 2.4
we will see different types of linguistic annotations and how they can be
relevant to humanities research.
• By medium: does the corpus contain only text, speech, video mate-
rial, or is it mixed?
• By size: does the corpus contain a static snapshot of a language vari-
ety (static corpus) or is it continually updated to monitor the evolu-
tion of language (monitor corpus)?
• By language: is the corpus monolingual or multilingual? If it is mul-
tilingual, have its parts been aligned (parallel corpus)?
• By time: does the corpus cover a language variety in a specific period
without considering its time evolution (synchronic corpus) or does
it focus on the change of a language variety over time (diachronic
corpus)?
• By purpose: was the corpus built to describe the general language
(like contemporary spoken English) or a special aspect of it (like the
language of medical emergency reports)?
5 https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/.
2 DESIGN OF TEXT RESOURCES AND TOOLS 15
built from news articles gained from their RSS feeds; it is updated daily
and contains 37 billion words. Such an unrestricted approach to cor-
pus building, however, is not always applicable to the text resources
employed in humanities scholarship, where a potentially complex inter-
action of research questions and availability of texts affects the size and
shape of the resources we can create. For example, sometimes only a few
texts or fragments have survived historical accidents and have found
their way into the collection, meaning that creating a balanced corpus is
simply not a viable option.
Three important considerations to keep in mind when building a
corpus in humanities research are access, digitization, and encoding.
Gaining access to a group of texts can often be anything but straight-
forward, requiring potentially complex issues to be negotiated such as
legal questions with third parties (who might have been responsible for
the digitization, for example), and privacy or human data protection
concerns. Even when we gain access to the texts, these may need to be
digitized, as any subsequent computational processing of the type we talk
about in this volume requires them to be in digital form. Once the texts
have been digitized, or even better during the digitization step itself, the
texts should be presented in such a way to enable their effective use in
research. In Sect. 2.1.2 we touched on the TEI guidelines, which pro-
vide a great basis for ensuring that digital texts are equipped with all
the metadata needed to place them in their historical context. Although
these topics are not the focus of this volume and therefore will not be
covered in depth, we acknowledge that access, digitization, and encoding
can have a significant impact on the decisions that follow in the research
process. In particular, the quality of the digitization can radically affect
the outcomes of quantitative analyses carried out on the texts, as shown,
for example, by Hill and Hengchen (2019).
Another challenge concerns historical texts, which are often the object
of study in the humanities and which require especially careful consid-
eration. One primary reason for this is that tools and methods devel-
oped in language technology research are still mainly concerned with
modern and well-established languages like English, but require special
adaptation when applied to historical languages (cf. Piotrowski 2012;
McGillivray 2014). Philological and interpretative issues are often of
major importance and need to be accurately incorporated in the corpus
design phase (cf. Meyer 2015). Furthermore, the lack of native speak-
ers of extinct languages or old varieties of living languages means that
16 B. McGILLIVRAY AND G. M. TÓTH
we cannot rely on native speaker intuition for the annotation, and extra
layers of checks and explicit guidelines are needed to achieve good qual-
ity results. The next section will describe a concrete use case involving a
historical language, Ancient Greek.
The project aimed to map the change in the meaning of words in the
history of Ancient Greek from the seventh century BCE to the fifth cen-
tury CE, an extremely ambitious goal. For this purpose, we had to build
the largest corpus possible. In Sect. 2.1.1 we stressed the aspiration to
representativeness. One of the important factors to keep in mind is the
role of genre in Ancient Greek semantics, so in the corpus design phase
we aimed at finding the best possible representation of Ancient Greek
genres. While scoping the genre distribution of the texts, we devised a
categorization into genre classes (such as Poetry, Narrative, or Technical)
and subclasses (such as Bucolic, Biography, or Geography).
The categorization aimed at the best possible representation of
Ancient Greek genres. The emphasis on “possible” is critical in this con-
text, as we were constrained by three main factors. First, the texts that
have survived historical accidents and have reached us are all we can hope
to obtain for Ancient Greek. Second, as new digitization was not within
the scope of the project, the number of available digital resources consti-
tuted the upper limit of what we were able to include. Third, even when
digitized editions exist, they may not be free to use and distribute, so we
sourced the texts from three openly available digital libraries (for details
see Vatri and McGillivray 2018). The corpus consists of 820 texts and it
counts 10,206,421 word tokens, making it the largest corpus of its kind
available today.
As is often the case in digital humanities projects, the texts came in
different formats, ranging from TEI XML, to non-TEI XML, HTML,
and Microsoft Word files.6 Therefore we had to allow for an initial phase
of cleaning and standardization of these formats into TEI-compliant
XML to allow further processing and analysis. Another important con-
sideration was character encoding. Greek characters can pose additional
challenges when it comes to encoding, and we found a range of options
in the sources, from Beta Code7 to UTF-8 Unicode, to HTML hexadec-
imal references. Taking the example from Vatri and McGillivray (2018),
for the Greek character ᾆ, the Beta Code is A) = |, the Unicode UTF-8
encoding is ᾆ, and the hexadecimal reference is F86;. We converted
all Greek characters to Beta Code for standardization purposes, choosing
this encoding because it makes automatic processing and retrieval easier.
8 In the example we can see that the XML tag <sentence> shows the beginning of the sen-
tence, and has the attributes id (which assigns a unique identifier to the sentence) and loca-
tion (which gives information about the passage to which the sentence belongs). Nested
inside the <sentence> tag we find a series of <word> tags, each corresponding to a word in the
sentence.
Another random document with
no related content on Scribd:
Fig. 155.
The cutting edge of the hole is at the smaller diameter; place that
side of the plate up. Never use a hammer as it would split the top of
the peg and would ruin the cutting edge of the dowel plate should it
strike it. Use a mallet, and when the peg is nearly thru finish by
striking a second peg placed upon the head of the first.
86. Directions for Doweling.—(1) Place the boards to be
doweled side by side in the vise, the
face sides out, and even the jointed edges. (2) Square lines across
the two edges with knife and trysquare at points where it is desired
to locate dowels. (3) Set the gage for about half the thickness of the
finished board and gage from the face side across the knife lines. (4)
At the resulting crosses bore holes of the same diameter as that of
the dowel.
Fig. 156.
Fig. 158.
91. Directions for Mortise in the Tenon.—(1) Lay out the sides
of the mortise for the key
before the sides and shoulders of the tenons are cut. From the
shoulder line of the tenon, measure toward the end a distance
slightly less—about one thirty-second of an inch—than the thickness
of the member thru which the tenon is to pass. This is to insure the
key’s wedging against the second member. (2) Square this line
across the face edge and on to the side opposite the face side. (3)
On the top surface measure from the line just squared around the
piece a distance equal to the width the key is to have at this point
when in place. Fig. 158, A B. (4) Square a pencil line across the
surface at this point. (5) In a similar manner, measure and locate a
line on the opposite side, C D, Fig. 158. (6) Set the gage and mark
the side of the mortise nearer the face edge on face side and side
opposite. (7) Reset, and from the face edge gage the farther side of
the mortise, marking both sides. (8) This mortise may be bored and
chiseled like the one preceding. As one side of the mortise is to be
cut sloping, a little more care will be needed.
Fig. 162.
Fig. 165.
97. Miter Joint.—The miter joint is subject to various
modifications. In the plain miter, Fig. 166, the ends
or edges abut. They are usually fastened with glue or nails or both.
The most common form of the plain miter is that in which the slope is
at an angle of forty-five degrees to the edge or side.
98. Directions for Miter Joint.—(1) Lay off the slopes (see
Chapter I, Section 4). (2) Cut and
fit the parts. To fit and fasten four miter joints, such as are found in a
picture frame, is no easy task. Special miter boxes are made for this
purpose which make such work comparatively easy. (3) Fig. 167
shows the manner of applying the hand clamps to a simple miter
joint. When a joint is to be nailed, drive the nail thru one piece until
its point projects slightly. Place the second piece in the vise to hold it
firmly. Hold the first piece so that its end projects somewhat over and
beyond that of the second; the nailing will tend to bring it to its proper
position, Fig. 168. If a nail is driven thru from the other direction, care
must be taken to so place it that it will not strike the first, or a split
join will result.
Fig. 168.
Fig. 173.
The corresponding mortises and tails may now be laid out on the
drawer side and worked. (10) By superposition, Fig. 179, mark out
the shape of the mortises to be cut in the sides. (11) Saw and chisel
these mortises. Fig. 172.
105. Directions for Drawer.—(1) Square the different members to
size. (2) Groove the front and sides of
the drawer to receive the drawer bottom. These grooves should be
made somewhat narrower than the bottom is thick to insure a good
fit. The under side of the bottom, later, may be gaged and beveled
on the two ends and the front edge, Fig. 180. (3) Lay out and cut in
the drawer sides the dadoes into which the ends of the back are to
be fitted, Fig. 181. (4) Lay out and cut the joints on the front of the
drawer. (5) Get the bottom ready; that is, plane the bevels on the
under side as suggested in 2, above. (6) Assemble the members dry
to see that all fit properly. (7) Take apart; glue the joints by which the
sides are fastened to the front and the joints by which the back is
fastened to the sides. Glue the bottom to the front of the drawer but
not to the sides or back.
Fig. 182.