You are on page 1of 52

Applying Language Technology in

Humanities Research: Design,


Application, and the Underlying Logic
1st ed. Edition Barbara Mcgillivray
Visit to download the full and correct content document:
https://ebookmass.com/product/applying-language-technology-in-humanities-researc
h-design-application-and-the-underlying-logic-1st-ed-edition-barbara-mcgillivray/
Applying Language
Technology in
Humanities Research
Design, Application, and
the Underlying Logic

Barbara McGillivray
Gábor Mihály Tóth
Applying Language Technology in Humanities
Research
Barbara McGillivray · Gábor Mihály Tóth

Applying
Language
Technology
in Humanities
Research
Design, Application, and the Underlying Logic
Barbara McGillivray Gábor Mihály Tóth
Faculty of Modern and Medieval Viterbi School of Engineering, Signal
Languages Analysis Lab (SAIL)
University of Cambridge University of Southern California
Cambridge, UK Los Angeles, CA, USA
The Alan Turing Institute
London, UK

ISBN 978-3-030-46492-9 ISBN 978-3-030-46493-6 (eBook)


https://doi.org/10.1007/978-3-030-46493-6

© The Editor(s) (if applicable) and The Author(s) 2020


This work is subject to copyright. All rights are solely and exclusively licensed by the
Publisher, whether the whole or part of the material is concerned, specifically the rights
of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and
retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of publication.
Neither the publisher nor the authors or the editors give a warranty, express or implied,
with respect to the material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

Cover illustration: © Melisa Hasan

This Palgrave Macmillan imprint is published by the registered company Springer Nature
Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The idea of this book goes back to the HiCor Research Network, founded
and led by us together with Gard Jenset and Kerry Russell. HiCor was
a research group of historians and corpus linguists at the University of
Oxford active between 2012 and 2014. It was generously supported by
TORCH (The Oxford Research Center for the Humanities). In addition
to organizing lectures and a workshop, HiCor also aimed to disseminate
language technology among historians and, more generally, humanists.
For instance, we organized several courses on Language Technology and
Humanities at the Oxford DH Summer School, which inspired this book.
We are grateful to Gard Jenset who helped to shape the initial ideas
underlying this book. We also thank our employers and funders for pro-
viding us with time and funding to accomplish the project.1
We have contributed equally to the design of the book. We have joint
responsibility for Chapter 1. Barbara McGillivray has primary responsi-
bility for Chapters 2 and 5. Gábor Tóth has primary responsibility for
Chapters 3, 4, 6, and 7.

Cambridge, UK Barbara McGillivray


London, UK
Los Angeles, USA Gábor Mihály Tóth
1Gabor Toth thanks the USC Shoah Foundation and the USC Viterbi School of

Engineering. Barbara McGillivray was supported by The Alan Turing Institute under
EPSRC grant EP/N510129/1.

v
Contents

1 Introducing Language Technology and Humanities 1


1.1 Why Language Technology for the Humanities? 1
1.2 Structure of the Book 3
References 6

2 Design of Text Resources and Tools 7


2.1 Text Resources in the Humanities 7
2.1.1 Text Resources and Corpora 9
2.1.2 Data and Metadata 10
2.2 Corpus Design and Creation 12
2.2.1 Designing a Text Resource 12
2.2.2 Humanities Corpora 14
2.3 Use Case: The Diorisis Ancient Greek Corpus 16
2.4 Corpus and Natural Language Processing Tools 20
2.4.1 Text-Processing Pipeline 20
2.4.2 Pre-processing and Tokenization 21
2.4.3 Stemming, Lemmatization, and Morphological
Annotation 23
2.4.4 Part-of-Speech Tagging 25
2.4.5 Chunking and Syntactic Parsing 27
2.4.6 Named Entities 28
2.4.7 Other Annotation 29
2.5 Conclusion 31
References 32

vii
viii CONTENTS

3 Frequency 35
3.1 Concept of Frequency 36
3.2 Application: The “Characteristic Vocabulary”
of the Moonstone by Wilkie Collins 39
3.3 Application: Terms with ‘Turbulent History’ in the Early
English Books Online 43
3.4 Conclusion 46
References 46

4 Collocation 47
4.1 The Concept of Collocation 48
4.2 Probability of a Bigram 49
4.3 Observed and Expected Probability of a Bigram 50
4.4 Strength of Association: Pointwise Mutual Information
(PMI) 52
4.5 Strength of Association: Log Likelihood Ratio 54
4.6 Application: What Residents of Modern London
Complained About 54
4.7 Conclusion 58
References 59

5 Word Meaning in Texts 61


5.1 The Study of Word Meaning 61
5.2 Distributional Approaches to Word Meaning 62
5.3 Word Space Models 64
5.3.1 Words in Space 64
5.3.2 Word Embeddings 68
5.4 Use Case: Exploring Smell in Historical Health Reports 69
5.4.1 Visualizing Words in the Semantic Space 71
5.4.2 Measuring Distances in the Semantic Space 72
5.5 Use Case: Finding Semantic Change in a Web Archive 75
5.6 Conclusion 78
References 78

6 Mining Textual Collections 81


6.1 Textual Similarity, an Old Problem 82
6.2 How to Construct a Feature Space 83
6.2.1 Feature Selection 84
6.2.2 Feature Scoring 88
CONTENTS ix

6.2.3 Representation as a Geometric Space 89


6.2.4 The Document–Term Matrix 90
6.2.5 Representation as a Vector Space 90
6.2.6 Summary 93
6.3 Application: Discovery of Similarity in the Anglo-Saxon
Chronicle 93
6.3.1 Transformation of the Anglo-Saxon Chronicle
into a Document Collection 94
6.3.2 Feature Extraction and Feature Selection 95
6.3.3 Construction of the Document–Term Matrix 96
6.3.4 Feature Scoring 97
6.3.5 Rendering a Feature Space Through Projection
to a Lower-Dimensional Space 99
6.3.6 Measuring the Cosine Similarity Between
Annals 102
6.3.7 Clustering 104
6.3.8 Topic Modelling 107
6.3.9 Topic as a Hidden Layer 110
6.3.10 Hierarchical Topic Modelling 111
6.3.11 Summary of Topic Modelling 113
6.4 Conclusion 113
References 114

7 The Innovative Potential of Language Technology


for the Humanities 117
7.1 Bridging Concepts Between Humanities and Language
Technology 117
7.2 A Critical View of Language Technology 121

Index 123
List of Figures

Fig. 3.1 Relative document frequency of lemma forsake in the EEBO


subcorpus 45
Fig. 4.1 Changes of log likelihood ratio (window: 5 words; direction:
left) between complain/complaint and dust, mouse, noise, rat,
smell, smoke in the London Health Reports dataset 56
Fig. 4.2 Changes of log likelihood ratio (window: 5 words; direction:
right) between complain/complaint and dust, mouse, noise, rat,
smell, smoke in the London Health Reports dataset 57
Fig. 5.1 Bi-dimensional representation of the words film, movie,
and quote using the coordinates from Table 5.2 67
Fig. 5.2 Bi-dimensional representation of the semantic space
from the London MOH reports. We have displayed the
points corresponding to the top 40,000 most frequent words,
and the labels of the words smell, stink, odour, perfume, table,
and house 73
Fig. 5.3 Simplified visualization of the semantic change of the noun
tweet in three semantic spaces 76
Fig. 6.1 Simplified representation of some car models in terms
of common features 84
Fig. 6.2 Some novels by Wilkie Collins and their representation using
library catalogue subject headings as common features
(*Source The on-line catalogue of the Bodleian Library,
Oxford, http://solo.bodleian.ox.ac.uk, accessed 1 January
2020) 85
Fig. 6.3 A simplified document collection of three English proverbs
and their representation through bag of words 87

xi
xii LIST OF FIGURES

Fig. 6.4 Representation of three English proverbs in a feature space


rendered as geometric space 91
Fig. 6.5 Representation of three English proverbs as document vectors 92
Fig. 6.6 Representation of the annals of the Anglo-Saxon Chronicle
in a projected space 101
Fig. 6.7 Similarity matrix of annals highlighted (Group 2) in Fig. 6.6 103
Fig. 6.8 Representation of the annals of the Anglo-Saxon Chronicle
in a projected space with some clusters highlighted 106
Fig. 6.9 Hierarchical topic modelling in the Anglo-Saxon Chronicle 112
List of Tables

Table 2.1 Top frequency word types in Shakespeare’s Hamlet 23


Table 2.2 Top frequency word types in Shakespeare’s Hamlet after
removing stop words 24
Table 5.1 Example of co-occurrence frequencies in a toy example
consisting of four sentences containing the nouns dog
and cat 65
Table 5.2 Example of co-occurrence frequencies for the lemmas film,
movie, and quote from the British National Corpus 2014
Spoken 66
Table 5.3 Example of co-occurrence frequencies for the lemmas film,
movie, and quote from the British National Corpus 2014
Spoken 67
Table 5.4 Cosine similarity measures between the word embeddings
for blackberry and phone, and blackberry and raspberry, 2000
and 2013. The embeddings are from https://zenodo.org/
record/3383660#.XfylShf7Sbc 77
Table 6.1 Summary of annals highlighted in Fig. 6.6 102
Table 6.2 Key topics extracted from the Anglo-Saxon Chronicle 108

xiii
CHAPTER 1

Introducing Language Technology


and Humanities

Abstract This chapter outlines the relevance of language technology


for the exploration and study of big textual data sets in the humanities.
We also discuss the importance of understanding the logic underly-
ing the use of language technology to resolve research problems in the
humanities. Finally, we outline the three pillars of the approach we follow
throughout the book: focus on application through both simplified and
more complex use-case examples; discussion of both the potential and
the limitations of language technology; and explanation of how to trans-
late humanities research questions into research problems using language
technology.

Keywords Big data · Distant reading · Textual resource ·


Language technology · Humanities research

1.1   Why Language Technology for the Humanities?


In the last two decades, the humanities have seen an unprecedented
change opening up new directions for the inquiry of human cultures and
their histories: the yet not fully explored availability of digitized human-
istic texts. Thanks to the mass digitization of analogue resources pre-
served in libraries and archives, large textual collections, such as Google
Books, Early English Books Online, and Project Gutenberg, have
become available on the World Wide Web. The rise of digital humanities

© The Author(s) 2020 1


B. McGillivray and G. M. Tóth, Applying Language Technology
in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_1
2 B. McGILLIVRAY AND G. M. TÓTH

as a new academic field has contributed to the proliferation of research


infrastructures and centres dedicated to the study and distribution of tex-
tual resources in the humanities. The mission of digital humanities pro-
jects such as CLARIN European Research Infrastructure, DARIAH and
the ESRC Centre for Corpus Approaches to Social Science is to make
textual resources not only available but also investigable for scholars.
Digital humanists have proposed the method of distant reading or macro
analysis for learning from large textual resources (Jockers 2013; Moretti
2015). Alongside a growing interest in large textual resources, there is
an increasing demand from (digital) humanities researchers for quanti-
tative and computational skills. The current offering in this space is rich,
with a range of training options (including dedicated summer schools
like the digital humanities training events at Oxford,1 DHSI at Victoria,2
or the European Summer School in Digital Humanities in Leipzig3) and
publications (examples include Bird et al. 2009; Gries 2009; Hockey
2000; Jockers 2014; Piotrowski 2012). Nonetheless, textual resources in
the humanities and beyond raise a key challenge: they are too big to be
read by humans interested in analysing them. The potential lying in the
exploration of large textual collections has not been fully realized; yet, it
remains a key task for the current and the next generations of humanities
scholars.
To explore tens of thousands of books or millions of historical docu-
ments, humanities scholars inevitably need the power of computing tech-
nologies. Among these technologies, there is one that has had and will
definitely continue to have a pivotal role in the exploration of big textual
resources. Language technology, which can help unlock and investigate
large amounts of textual data, is a truly interdisciplinary enterprise. It is
not an academic field per se; it is rather a collection of methods that deal
with textual data. Language technology sits at the crossroads between
corpus and computational linguistics, natural language processing and
text mining, data science and data visualization. As we will demonstrate
throughout this book, language technology can be used to address a
great variety of research problems involved in the investigation of textual
data in the humanities and beyond.

1 https://www.dhoxss.net/.

2 https://dhsi.org.

3 http://esu.culintec.de/.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 3

1.2  Structure of the Book


This book examines research problems that are relevant for humani-
ties and can be addressed with the help of language technology. The
first chapter demonstrates how language technology can help structure
raw textual data and represent them as a resource meaningful for both
humans and computers. For instance, the lyrics of thousands of popular
songs are now available in plain text on the World Wide Web. But lyrics
in plain text format do not distinguish the title and the refrain of a song.
This is an example of unstructured data because various components of
a song are not marked in a way that computers can a­ utomatically extract
them. Language technology can help detect structural components
within a text such as the refrain of a song; it can also help represent a
song in digital form so that different structural components are distin-
guished and readily available for further computational investigations.
Language technology also supports word-level investigations of textual-
ity. The lyrics of a song consist of not only structural units, but also dif-
ferent types of words such as nouns, verbs, and names of people. In plain
text format, word-level information about lyrics is not readily usable
by computing tools; for instance, it is not possible to extract all proper
names from a collection of lyrics in plain text. As Chapter 2 explains,
language technology helps attach different types of information to each
word of a text; it also offers ways to record this information in well-es-
tablished data formats.
Language technology also facilitates the bottom-up exploration
of textual resources and textuality. For instance, finding terms that are
significant elements of a text is an important component of bottom-up
explorations. We will discuss how the investigation of word frequency
can support this in Chapter 3. Language technology methods can map
terms closely related to a given concept in thousands of texts. This form
of bottom-up exploration is discussed in Chapter 4. Language tech-
nology methods can also help in bottom-up studies of word meaning.
For instance, the meaning of a concept can be investigated by draw-
ing on a dictionary definition, but it can also be inferred from the way
authors used that concept in their works. Chapter 5 examines how lan-
guage technology enables this type of exploration of meaning. Finally,
language technology has tools to detect patterns recurring over thou-
sands of texts. As the proverb says, there is nothing new under the sun.
Similar themes and ideas recur over texts from different historical times.
4 B. McGILLIVRAY AND G. M. TÓTH

However, detecting them in large textual resources is a tedious (or some-


times impossible) task for human readers. As Chapter 5 illustrates, lan-
guage technology supports humans in their efforts to detect recurrence
and similarity in texts.
To realize the rich potential that language technology offers, human-
ists need to bridge two interrelated gaps. The first is the conceptual gap
between humanities research problems and language technology meth-
ods. As a simple example, language technology can detect how many
times a given term is used in a given set of historical sources. In more
technical terms, with language technology we can study word frequency.
But rarely do historians ask how many times a term occurs in their source
texts. Rather, they inquire about the prevailing social concepts in a given
historical time. There is a conceptual gap between word frequency and
the prevailing social concepts. This simple example also sheds light
upon the second gap, which lies between qualitative and quantitative
approaches. The insights that language technology can deliver are very
often quantitative and difficult to interpret with a qualitative framework.
Bridging these gaps is a daunting task for scholars, and this publication
seeks to assist them in this task. We believe that the potential of lan-
guage technology can be realized if there is a clear understanding of the
logic underlying it. The overall goal of this book is therefore to apply
the logic of language technology to the resolution of humanistic research
problems. We will attempt to convey this logic by following a didactic
approach with three pillars.
First, we guide you through various research procedures involved in
the application of language technology. The first chapter looks at the
design of language resources, the first step in the application of language
technology. The following chapters study specific ­ humanities-related
research problems and show how to design quantitative research pro-
cedures to address them. We believe that an understanding of how
to design a research process in language technology is one of the key
steps to understanding its overall logic. We do not, however, explain
the technical implementation of the research procedures discussed
throughout the book.4 Thanks to the development of computing
tools in popular programming languages, such as Python and R, many

4 The Python implementation can be found in the following github repository: https://

github.com/toth12/language-technology-humanities.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 5

of the technological procedures presented here have been (at least


partially) automated, and their implementation can be learnt by follow-
ing excellent on-line tutorials and manuals. But what is difficult to learn
from on-line resources is how language technology is related to existing
research goals and practices in the humanities. We draw on both complex
and simple examples to illustrate this. Sometimes these illustrative exam-
ples will be simplistic; we call these ‘toy examples’. Despite their simplic-
ity, we believe that readers can grasp otherwise highly complex research
problems and procedures through them.
Second, we highlight both the potential and the limitations of lan-
guage technology. We believe that the logic of a technology can be
understood if one is aware of what that technology can and cannot
resolve. A thorough critical understanding is crucial to use technology in
an innovative way.
Third, we will return again and again to the two gaps described
above. Resolving the conceptual gap involves a process that is similar
to translation. In order to make use of language technology, humani-
ties researchers have to express, or more technically speaking operation-
alize, their research questions and problems in way that are meaningful
from the language technology perspective. This translation process will
be demonstrated through various applied examples. Similarly, address-
ing the gap between quantitative and qualitative views needs a transla-
tion process. Highly complex mathematical procedures need humanistic
analogies so that their results can inform qualitative research prob-
lems. Throughout the book we will attempt to establish such analogies.
Although these might sound simplistic to readers trained in mathematics,
we believe that our simple and accessible explanations will enable readers
to build a more solid understanding of language technology.
With language technology playing a pivotal role in the discovery
and analysis of textual data, this book offers an accessible overview of
the main topics that can be considered under the umbrella term of lan-
guage technology: corpus linguistics, computational linguistics, natu-
ral language processing, and text mining. Our aim is to focus on those
aspects that are relevant to a readership of humanists. To keep this vol-
ume agile and easy to handle, some topics have been removed from its
scope. For example, sentiment analysis is only briefly touched on, and we
have not been able to cover many other important areas, including stylo-
metrics, geospatial analysis, and authorship attribution. Space constraints
6 B. McGILLIVRAY AND G. M. TÓTH

also mean that many details concerning the topics covered were omit-
ted. However, we aimed to provide basic information to further explore
themes that are of particular interest to readers.

References
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. Sebastopol, CA: O’Reilly.
Gries, S. T. (2009). Quantitative Corpus Linguistics with R. New York, NY and
Abingdon: Routledge.
Hockey, S. (2000). Electronic Texts in the Humanities: Principles and Practice.
Oxford: Oxford University Press.
Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History.
Champaign, IL: University of Illinois Press.
Jockers, M. L. (2014). Text Analysis with R for Students of Literature. New York,
NY: Springer.
Moretti, F. (2015). Distant Reading. London: Verso.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts.
San Rafael, CA: Morgan and Claypool.
CHAPTER 2

Design of Text Resources and Tools

Abstract This chapter guides the reader through the key stages of
­creating language resources. After explaining the difference between lin-
guistic corpora and other text collections, the authors briefly introduce
the typology of corpora created by corpus linguists and the concept of
corpus annotation. Basic terminology from natural language processing
(NLP) and corpus linguistics is introduced, alongside an explanation of the
main components of an NLP pipeline and tools, including p ­ re-processing,
part-of-speech tagging, lemmatization, and entity extraction.

Keywords Corpus · Text collection · Metadata · Annotation ·


Natural Language Processing (NLP) · Pipeline · Tool · Part-of-speech
tagging · Lemmatization

2.1  Text Resources in the Humanities


This chapter will guide you through the key stages of creating and using
text resources and tools. We use the term text resource to refer to col-
lections of texts of various kinds, as well as other types of text-based
resources encountered in the humanities. The key difference here is that
collections typically contain running text, often organized into sections,
books, and so on, while other text-based resources display text content
which is not necessarily a connected piece of work. For example, the first

© The Author(s) 2020 7


B. McGillivray and G. M. Tóth, Applying Language Technology
in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_2
8 B. McGILLIVRAY AND G. M. TÓTH

edition of the Encyclopaedia Britannica1 was digitized by the National


Library of Scotland and is organized into volumes and pages, hence
we call it a text collection. On the other hand, resources like historical
maps can contain textual parts which are not in the form of running text.
Some of the techniques that we cover here, such as lemmatization or
semantic analysis, will also be suitable for exploring such resources.
Many text collections encountered in the humanities are not in digital
form, and a large proportion of digital humanities research focuses on the
process of digitizing such texts and providing them to the scholarly com-
munity. The focus of this book is on texts in digital form, as this is a pre-
condition for the type of computer processing that we are concerned with.
A special category of text collections is that of linguistic corpora,
which by definition are designed with the specific purpose of studying
human language. But text analysis can reveal important patterns about
society and culture, well beyond questions of linguistic interest. For
example, finding occurrences of place names, personal names, or men-
tion of events in texts, and counting those instances both independently
and in relation to other entities can help us access aspects of the ­content
at a scale that is not reachable by close reading. Purely linguistic units
like patterns of use of modal verbs (like must, should, or ought to in
English) can be used to study abstract concepts such as obligation in a
historical period. Therefore, the experience of corpus research, including
approaches to corpus design and creation, is helpful as a way to draw rich
information from texts.
Over the past decades, linguists have dedicated major efforts to cre-
ating corpora and developing ways of enriching them with annotation.
These efforts have the potential to benefit other humanities disciplines.
For example, if positive and negative adjectives are annotated in a novel,
then we can analyse the sentiment of the text, and relate it to different
characters to answer questions such as “Is character X associated with
negative sentiment and how does this change throughout the novel?” In
this chapter we describe a typology of linguistic corpora and some of their
most useful features. We focus on how corpus design questions such as
balance and representativeness may impact research outcomes. We also
explain the workings of annotation, stressing the benefits it can bring to

1 https://digital.nls.uk/encyclopaedia-britannica/archive/144133900.
2 DESIGN OF TEXT RESOURCES AND TOOLS 9

humanities scholarship, and discuss the challenges that text resources of


interest to humanists pose for the design and creation phases.
Regarding tools to manipulate and enrich texts, we will borrow some
basic terminology from the fields of natural language processing (NLP)
and corpus linguistics, and explain the main concepts behind processes
such as tokenization, part-of-speech tagging, lemmatization, and so on.
The main focus is on how these tools are needed in humanities research
and can guide the exploration and close reading of texts.

2.1.1   Text Resources and Corpora


Here we use the term text resource broadly to refer to any resource con-
taining some text. Corpora (plural of corpus) are a special type of text
resources because they are collections typically created by linguists to
answer specifically linguistic questions. In fact, a branch of linguistics
called corpus linguistics emerged in the second half of the twentieth cen-
tury to define the characteristics of corpora, and to create and use them
in linguistic studies. For an overview of the history of corpus linguistics,
see McEnery and Hardie (2013).
There are several definitions of what a corpus actually is. One of the
most famous definitions is by Sinclair (2005): a corpus is “a collection of
pieces of language text in electronic form, selected according to external
criteria to represent, as far as possible, a language or language variety as
a source of data for linguistic research”. Along similar lines, Xiao et al.
(2006) say that “a corpus is a collection of machine readable authentic
texts (including transcripts of spoken data) which is sampled to be rep-
resentative of a particular language or language variety”. Both these defi-
nitions name the main features of corpora: electronic format, the fact
that they contain naturally occurring language, and the fact that they
are meant to represent a language or a part of it (technically called “vari-
ety”). For example, the corpus of the frequency lexicon of spoken Italian
(Bellini and Schneider 2003–18) contains 469 texts that are transcriptions
of lectures, TV and radio programmes, and spoken interactions between
people. This corpus was used to create the first frequency dictionary of
spoken Italian, which gives information about how frequently words are
used in this language. The aim was to support scholars studying the lan-
guage variety of spoken Italian. To realize this goal the corpus compil-
ers decided to focus on four cities (Milan, Florence, Rome, and Naples)
deemed representative of the broad range of features of spoken Italian.
10 B. McGILLIVRAY AND G. M. TÓTH

One important difference between humanities text collections and lin-


guistic corpora is that the former are typically not created for the purpose
of linguistic studies and therefore do not aim at being representative of a
language variety. For example, the Darwin Correspondence Project2 has
published the full texts of more than 9000 of Charles Darwin’s letters.
In this project the letters have been collected into a resource available to
the community of historians of science and other scholars interested in
analysing Charles Darwin’s views and entourage, and more broadly the
creation of his scientific network, his impact on the scientific discourse of
his time, and his legacy. However, the Darwin Letters are not meant to
be representative of the English language or the scientific language of the
time. Many of the text technology tools and terminology developed by
corpus linguists are very useful when analysing and processing the texts
of a non-linguistic project like the Darwin Correspondence Project, and
we will describe such tools and terminology in the next sections.

2.1.2   Data and Metadata


We have seen that text resources can contain text in various forms. Let
us take the example of the Hartlib Papers,3 which contain the full-text
transcription (as well as facsimile images) of the manuscripts of the corre-
spondence of Samuel Hartlib (c.1600–62), a seventeenth-century ‘intel-
ligencer’ and man of science. The texts of these letters reveal interesting
insights into the topics talked about in Hartlib’s circle. However, for
many research questions it is of critical importance to scholars to know
a range of other attributes in addition to the texts of the letters, such as
the library subject header, the year in which the letter was written, who
wrote it, and its addressee, gender, location, and so on. This allows us to
investigate, for instance, how many women corresponded with Samuel
Hartlib, and whether this number changed over time. Together, all
these features about the context of the text are referred to as metadata.4
Combining the text data with the metadata makes it possible to answer
even more questions, such as: how did the topics of the letters change

2 https://www.darwinproject.ac.uk.

3 https://www.dhi.ac.uk/hartlib/context.

4 We will follow the Oxford Dictionaries in using metadata as a mass noun and data as a

plural noun.
2 DESIGN OF TEXT RESOURCES AND TOOLS 11

over time? Did Hartlib use a different style when addressing certain
­personalities? Does the length of the letters tend to change over time?
Metadata can be of different types, depending on the kind of infor-
mation it provides. We follow the categorization in Burnard (2005)
and distinguish between descriptive, administrative, editorial, and ana-
lytic metadata. The scope of the first two categories is the collection as
a whole, while the latter ones apply to smaller text units. Descriptive
metadata accesses external information about the context of the text,
such as its source, date of publication, and the sociodemographics of the
authors. Administrative metadata contains information about the collec-
tion itself, for example its title, its version, encoding, and so on. Editorial
metadata, on the other hand, provides information about the editorial
choices that the creators of the digital collection made with respect to
the original text, for example regarding additions, omissions, or correc-
tions. Finally, analytic metadata focuses on the structure of the text, for
example by marking the beginning and end of sections or paragraphs.
Metadata can be encoded into text resources in various ways, either
in external documentation or as part of the collections themselves. The
Text Encoding Initiative (TEI) has developed detailed guidelines for the
encoding of texts in digital format and it has become a widely accepted
standard in the digital humanities. The TEI guidelines specify, among
other things, how the metadata of a text should be displayed in what is
known as the TEI header (for details see TEI Consortium 2019).
As we have said earlier, metadata combined with text data offers the
widest scope for insightful ways to explore texts. Moreover, the texts
themselves can be enriched via annotation to optimize the implicit lin-
guistic information they contain and make it usable for large-scale anal-
yses. Let us imagine that we have access to a large collection of digitized
newspapers and we are interested in analysing the level of international
relations exemplified in this collection. Knowing the geographical ori-
gin of each newspaper is of primary importance, but it is not sufficient
because a newspaper article may talk about a location which is differ-
ent from its place of publication. Hence, we would want to conduct
an in-depth search of the texts to find, for example, instances of place
names. This can be a very time-consuming (or sometimes impossible)
process if we need to read all the articles. Without good disambigua-
tion, we may have to ignore many instances of potentially irrelevant hits
while at the same time missing a high number of relevant hits. For exam-
ple, Paris is the name of the French capital but is also the name of a
12 B. McGILLIVRAY AND G. M. TÓTH

city in Texas, and being able to distinguish the two means that we can
know whether a particular mention refers to international relationships
with France or the United States. Moreover, Paris can also be a per-
son’s name, and at the same time the city can be referred to in different
ways (e.g., ‘the City of Lights’), so again being able to disambiguate the
usages of this name in context is very useful.
As noted by McEnery and Wilson (2001, p. 32), annotation makes
the linguistic information in a text computationally retrievable, thus ena-
bling a wide range of searches that can be performed in a manual, auto-
matic, semi-automatic, or crowd-sourced way, depending on whether
humans, computers, a combination of humans and computers, or groups
of humans are responsible for it. For a detailed overview of linguistic
annotation, see Jenset and McGillivray (2017, pp. 99 ff.). In Sect. 2.4
we will see different types of linguistic annotations and how they can be
relevant to humanities research.

2.2  Corpus Design and Creation


This section will guide you through the decisions involved in designing
and creating a corpus for humanities research. We will borrow most of
the terminology from corpus linguistics, but will also discuss what should
be adapted to the specific needs of humanities research. We will cover
issues of availability of the data, representativeness, and their impact on
the research outcomes. Finally, we will walk you through a use case, an
Ancient Greek corpus built for the purpose of studying how Ancient
Greek words change their meaning over time.

2.2.1   Designing a Text Resource


In humanities research it is very common to start from a question and
then search for the best evidence to answer it. In other cases, we may
have already identified an existing resource that we are interested in,
for example an archive, and we want to use it for research. To take an
example, the National Library of Scotland has recently made available
the first ten editions of the Encyclopaedia Britannica in digital form.
This is an impressive resource which allows us to explore a range of ques-
tions regarding the composition of the text, the differences between
editions, the various topics covered in it, the relationship between text
and images, and many more. But if we wanted to investigate themes that
2 DESIGN OF TEXT RESOURCES AND TOOLS 13

relate to the wider historical context of this work or comparisons with


other encyclopaedias from different eras or geographical regions, for
example, we would need to expand our evidence base to other resources.
So we would find ourselves in the position of assembling a suitable cor-
pus. In this section we will explore the steps involved in designing a text
resource in a way that lets us answer the questions we care about.
Let us imagine, for example, that we are interested in analysing the
spread of diseases in France in the nineteenth century. What are the best
sources to address this topic? Within the limits of our time and resources,
we should choose the available texts that allow us to best tackle the task
in question. How to realize this in practice? A useful starting point is to
consider the features of the texts we would need to collect. Some helpful
criteria for designing the resource can be borrowed from corpus linguis-
tics, where corpora can be categorized in the following ways:

• By medium: does the corpus contain only text, speech, video mate-
rial, or is it mixed?
• By size: does the corpus contain a static snapshot of a language vari-
ety (static corpus) or is it continually updated to monitor the evolu-
tion of language (monitor corpus)?
• By language: is the corpus monolingual or multilingual? If it is mul-
tilingual, have its parts been aligned (parallel corpus)?
• By time: does the corpus cover a language variety in a specific period
without considering its time evolution (synchronic corpus) or does
it focus on the change of a language variety over time (diachronic
corpus)?
• By purpose: was the corpus built to describe the general language
(like contemporary spoken English) or a special aspect of it (like the
language of medical emergency reports)?

Of course, some of these criteria may co-exist, so that, for instance, we


may create a collection of health reports (text) in French (monolingual)
covering the nineteenth century (diachronic). But how can we make sure
that our resource is good enough to investigate our question, i.e., the
spread of diseases in nineteenth-century France?
In the course of its history, the field of corpus linguistics has witnessed
a hot debate around the topic of representativeness, and corpus linguists
have developed methodologies for building balanced corpora that aim at
being representative of the language under study. These methodologies
14 B. McGILLIVRAY AND G. M. TÓTH

typically involve drawing a prioritized inventory of the relevant features,


for example register, region, time period, and so on; then estimating a
target size for each feature, and then assembling the corpus according
to these proportions. In the example about the spread of diseases in
­nineteenth-century France, we should make sure that the texts are sam-
pled from different regions in such a way as to reflect the diversity of
France. For instance, given the prominent role of Paris, we would expect
a high proportion of reports to be from this city, but at the same time we
would want to ensure that other regions are suitably represented as well.
In addition to geographical provenance, we should account for other fea-
tures such as text type, and make sure that the texts represent a range
of different health-related texts by different roles (medical professionals,
general public, scientists, etc.).
As McGillivray (2014, pp. 11–13) discusses, in spite of the efforts
to build balanced corpora, representativeness remains an ideal limit and
more recently a different approach aimed at gathering the most inclusive
set of texts has gained popularity. This has led to the creation of very
large corpora containing as many texts as it is feasible to collect. This
topic has been discussed at length outside linguistics (see, for example,
Underwood 2019; Bode 2020). While not taking part in this debate
here, we believe that engaging with the issue of representativity in a crit-
ical way is useful when designing a text resource for humanities research.
In the next section we will dive deeper into the features of the text
resources typically considered in humanities research, and stress some key
differences from linguistic corpora.

2.2.2   Humanities Corpora


In this chapter we have made a distinction between humanities text
resources in general on the one hand and linguistic corpora on the
other. In Sect. 2.1.1 we saw that today’s corpora tend to be assem-
bled with the aim of including as many relevant texts as possible, even
if this means compromising on the balance between different features.
This is true especially in the case of contemporary languages, for which
the Internet can provide a huge source of born-digital texts, leading
to very large corpora whose size can be measured in billions of words.
One example is the JSI Timestamped English corpus,5 an English corpus

5 https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/.
2 DESIGN OF TEXT RESOURCES AND TOOLS 15

built from news articles gained from their RSS feeds; it is updated daily
and contains 37 billion words. Such an unrestricted approach to cor-
pus building, however, is not always applicable to the text resources
employed in humanities scholarship, where a potentially complex inter-
action of research questions and availability of texts affects the size and
shape of the resources we can create. For example, sometimes only a few
texts or fragments have survived historical accidents and have found
their way into the collection, meaning that creating a balanced corpus is
­simply not a viable option.
Three important considerations to keep in mind when building a
corpus in humanities research are access, digitization, and encoding.
Gaining access to a group of texts can often be anything but straight-
forward, requiring potentially complex issues to be negotiated such as
legal questions with third parties (who might have been responsible for
the digitization, for example), and privacy or human data protection
concerns. Even when we gain access to the texts, these may need to be
digitized, as any subsequent computational processing of the type we talk
about in this volume requires them to be in digital form. Once the texts
have been digitized, or even better during the digitization step itself, the
texts should be presented in such a way to enable their effective use in
research. In Sect. 2.1.2 we touched on the TEI guidelines, which pro-
vide a great basis for ensuring that digital texts are equipped with all
the metadata needed to place them in their historical context. Although
these topics are not the focus of this volume and therefore will not be
covered in depth, we acknowledge that access, digitization, and encoding
can have a significant impact on the decisions that follow in the research
process. In particular, the quality of the digitization can radically affect
the outcomes of quantitative analyses carried out on the texts, as shown,
for example, by Hill and Hengchen (2019).
Another challenge concerns historical texts, which are often the object
of study in the humanities and which require especially careful consid-
eration. One primary reason for this is that tools and methods devel-
oped in language technology research are still mainly concerned with
modern and well-established languages like English, but require special
adaptation when applied to historical languages (cf. Piotrowski 2012;
McGillivray 2014). Philological and interpretative issues are often of
major importance and need to be accurately incorporated in the corpus
design phase (cf. Meyer 2015). Furthermore, the lack of native speak-
ers of extinct languages or old varieties of living languages means that
16 B. McGILLIVRAY AND G. M. TÓTH

we cannot rely on native speaker intuition for the annotation, and extra
layers of checks and explicit guidelines are needed to achieve good qual-
ity results. The next section will describe a concrete use case involving a
­historical language, Ancient Greek.

2.3  Use Case: The Diorisis Ancient Greek Corpus


In Sect. 2.2.2 we stressed some of the features of humanities corpora,
including the specific challenges posed by historical texts. This section is
dedicated to a case study on the Diorisis Ancient Greek corpus, which is
described in detail in Vatri and McGillivray (2018) and which will give us
the opportunity to explore the process of corpus design starting from the
original research questions through a concrete example.
The Diorisis corpus was built in the context of the “Computational
models of meaning change in natural language texts” project funded by
the Alan Turing Institute (McGillivray et al. 2019; Perrone et al. 2019).
The interdisciplinary project team included scholars in classics, NLP, sta-
tistics, and digital humanities, who worked together for six months to
begin to explore the following question: how can we identify the change
in meaning of Ancient Greek words over the history of this language?
The question of meaning change (or semantic change, more precisely)
is relevant to a range of humanistic disciplines. In fact, a large part of
humanities research involves interpreting meaning in textual sources, for
example to find instances of entities or concepts based on which we can
analyse historical, cultural, and social trends, or explore the connection
between language and stylistic and geographical factors.
Words can have many meanings and this changes over time and
across registers, geography, style, etc. Let us take the example of the
Ancient Greek word mus, which can mean ‘mussel’, ‘muscle’, ‘whale’,
or ‘mouse’. Imagine that we are interested in medical terminology, how
can we find only those texts that display the medical meaning of mus?
Knowing the genre of a text will obviously help, as the medical mean-
ing is more likely to be found in medical texts, but occurrences of the
medical meaning can also be found in other texts. Historical dictionaries
usually give some examples of usage of each word meaning, but do not
attempt to give a full account of the literature, and using close reading
methods, we would need to read and record the meaning of all words
in every single text ever written, which clearly does not scale up to very
large text collections. So, having access to an annotated corpus can make
all the difference (McGillivray et al. 2019).
2 DESIGN OF TEXT RESOURCES AND TOOLS 17

The project aimed to map the change in the meaning of words in the
history of Ancient Greek from the seventh century BCE to the fifth cen-
tury CE, an extremely ambitious goal. For this purpose, we had to build
the largest corpus possible. In Sect. 2.1.1 we stressed the aspiration to
representativeness. One of the important factors to keep in mind is the
role of genre in Ancient Greek semantics, so in the corpus design phase
we aimed at finding the best possible representation of Ancient Greek
genres. While scoping the genre distribution of the texts, we devised a
categorization into genre classes (such as Poetry, Narrative, or Technical)
and subclasses (such as Bucolic, Biography, or Geography).
The categorization aimed at the best possible representation of
Ancient Greek genres. The emphasis on “possible” is critical in this con-
text, as we were constrained by three main factors. First, the texts that
have survived historical accidents and have reached us are all we can hope
to obtain for Ancient Greek. Second, as new digitization was not within
the scope of the project, the number of available digital resources consti-
tuted the upper limit of what we were able to include. Third, even when
digitized editions exist, they may not be free to use and distribute, so we
sourced the texts from three openly available digital libraries (for details
see Vatri and McGillivray 2018). The corpus consists of 820 texts and it
counts 10,206,421 word tokens, making it the largest corpus of its kind
available today.
As is often the case in digital humanities projects, the texts came in
different formats, ranging from TEI XML, to non-TEI XML, HTML,
and Microsoft Word files.6 Therefore we had to allow for an initial phase
of cleaning and standardization of these formats into TEI-compliant
XML to allow further processing and analysis. Another important con-
sideration was character encoding. Greek characters can pose additional
challenges when it comes to encoding, and we found a range of options
in the sources, from Beta Code7 to UTF-8 Unicode, to HTML hexadec-
imal references. Taking the example from Vatri and McGillivray (2018),
for the Greek character ᾆ, the Beta Code is A) = |, the Unicode UTF-8
encoding is ᾆ, and the hexadecimal reference is &#1F86;. We converted
all Greek characters to Beta Code for standardization purposes, choosing
this encoding because it makes automatic processing and retrieval easier.

6 See http://teibyexample.org/modules/TBED00v00.htm?target=markuplanguages for

an explanation of these terms.


7 https://www.tlg.uci.edu/encoding.
18 B. McGILLIVRAY AND G. M. TÓTH

For example, if we want to easily find occurrences of the same word


starting with or without a capital letter and match them to a digital dic-
tionary, we can easily do that with Beta Code. This is because Beta Code
encodes capitalization by adding an asterisk (*) to the letter character,
so we can easily look up the capitalized and non-capitalized forms of the
same word by adding or removing the asterisk.
The format of the corpus was determined by further processing,
aimed at identifying semantic change in a computational way. This means
that, instead of one single file of running text, the corpus is organized in
several text files to enable faster programmatic access to it. Moreover, the
text is split into sentences, as a sentence is the unit of input for the com-
putational model. We marked sentence boundaries as analytic metadata
in the text.
Below is an excerpt from the text file for the work Leucippe and
Clitophon by Achilles Tatius, where we have removed the linguistic anno-
tation on lemma and morphological analysis (see Sect. 2.4 for more
details) and only included the first four words of the first sentence (hence
the ellipsis).8

<sentence id="1" location="1.1.1">


<word form="*sidw\n" id=1"></word>
<word form="e)pi\" id="2"></word>
<word form="qala/tth|" id="3"></word>
<word form="po/lis" id="4"></word>

</sentence>

We retained analytic metadata information regarding the line, book,


chapter, or section of each sentence and whether a text chunk was a quo-
tation or not. We also encoded modern additions to fragmentary texts as
editorial metadata, so as to allow for their easy retrieval in case they were
relevant for subsequent analysis, but we excluded elements that were not
needed for the analysis, such as footnotes and critical apparatuses. Finally,
we encoded the text-level metadata in the TEI header. In our case, the

8 In the example we can see that the XML tag <sentence> shows the beginning of the sen-

tence, and has the attributes id (which assigns a unique identifier to the sentence) and loca-
tion (which gives information about the passage to which the sentence belongs). Nested
inside the <sentence> tag we find a series of <word> tags, each corresponding to a word in the
sentence.
Another random document with
no related content on Scribd:
Fig. 155.

The cutting edge of the hole is at the smaller diameter; place that
side of the plate up. Never use a hammer as it would split the top of
the peg and would ruin the cutting edge of the dowel plate should it
strike it. Use a mallet, and when the peg is nearly thru finish by
striking a second peg placed upon the head of the first.
86. Directions for Doweling.—(1) Place the boards to be
doweled side by side in the vise, the
face sides out, and even the jointed edges. (2) Square lines across
the two edges with knife and trysquare at points where it is desired
to locate dowels. (3) Set the gage for about half the thickness of the
finished board and gage from the face side across the knife lines. (4)
At the resulting crosses bore holes of the same diameter as that of
the dowel.
Fig. 156.

These holes should be bored to a uniform depth. Count the turns


of the brace. One inch is a good depth for ordinary work. (5)
Countersink the holes slightly, just enough to remove the sharp
arrises. This removes any bur and allows a little space into which the
surplus glue may run. (6) Cut the sharp arrises off the dowel, just
enough to allow it to be started into the hole. (7) With a stick slightly
smaller than the hole, place glue upon the sides of the hole, and
drive the dowel in. A small V-shaped groove previously cut along the
side of the dowel will allow the surplus glue to escape and thus
prevent any danger of splitting the board. (8) Clean off the surplus
glue, unless the members can be placed together before it has had
time to set. (9) Saw off the dowels to a length slightly less than the
depth of the holes in the second piece. (10) Trim off the sharp
arrises. Fig. 156. (11) Glue the holes and the edge of the second
board. (12) Put the two members in the clamps and set away until
the glue has had time to harden.
Fig. 157.

87. Keyed Tenon-and-Mortise.—Fig. 157 shows the tenon, the


mortise in the second member into
which the tenon fits, the mortise in the tenon and its key or wedge.

Fig. 158.

88. Directions for Key:—Keys are made in quite a variety of


shapes. Some of the simple forms are
shown in Fig. 158. Where two or more keys of the same size are to
be made, it is customary to plane all in one piece. (1) Plane a face
side, a face edge, gage and plane to thickness. If there is more than
one key, saw each to length. (2) Shape the remaining edge as
desired. The lines AB and CD, Fig. 158, indicate the points at which
measurements are to be made to determine the length of mortise in
the tenon which is to receive the key. These lines should be laid off
at a distance apart equal to the thickness of the tenon.
89. Directions for Tenon.—(1) Measure from the end of the piece
the length of the tenon, and mark with a
knife point. Where tenons are to be cut on both ends of a piece,
measurement is frequently made from the middle of the piece each
way to locate the shoulders. Should there be any variation in the
length of the piece from what it should be, this difference will then be
equally divided at the ends. This is done when it is more important to
have the distance between the shoulders of a definite length than
that the tenons be of correct length. (2) Square knife lines entirely
around the piece at the knife point mark. (3) Set the gage equal to
the distance required from the face edge to the nearest edge of the
tenon and mark on both sides, as far as the shoulder marks, and on
the end. (4) Repeat, setting the gage from the face edge to the
farther edge of the tenon. If the two members are of the same width
and the tenon and mortise are to be equally distant from the face
edge, both tenon and mortise should be gaged with the same
settings. Frequently, the gage settings are obtained from the rule
indirectly. The rule is laid across the piece and the width or thickness
of mortise or tenon marked with the point of a knife blade, Fig. 159.
The spur of the gage is then set in one of these points, the block
being pushed firmly against the face; the thumb-screw is then
fastened, Fig. 160. The second setting is obtained in a similar
manner from the same edge or side. All the pieces are marked for
the first width before resetting. (5) After having laid out the mortise in
the tenon, rip to the gage lines and cross-cut to the shoulder lines,
paring if necessary. (6) Slightly bevel the ends of the tenon.
Fig. 159. Fig. 160.

90. Directions for Mortise.—(1) From one end of the piece


measure and mark with the knife point
the respective distances to the two edges of the mortise. (2) Square
lines across the face edge and the two broad surfaces at these
points. (3) Set the gage equal to the required distance from the face
edge to the nearer edge of the mortise and mark between the lines.
(4) Set the gage equal to the required distance from the face edge to
the farther edge of the mortise and mark between the lines. Make
both gage lines on face side and side opposite as well. (5) Cut the
mortises. First, bore a series of holes thru the mortise, using a bit
somewhat smaller than the width of the mortise. Bore these holes so
that they connect one with another. (6) Place the piece on a chiseling
board and, taking thin cuts about half way thru, work from the middle
of the mortise out to within one thirty-second of an inch of the knife
and gage lines. (7) Reverse and chisel from the other side, finishing
it; then chisel the first side out to the lines. Test the sides of the
mortise with a straight edge—the blade of the chisel makes a good
one—to see that they are cut straight. Fig. 161.
Fig. 161.

91. Directions for Mortise in the Tenon.—(1) Lay out the sides
of the mortise for the key
before the sides and shoulders of the tenons are cut. From the
shoulder line of the tenon, measure toward the end a distance
slightly less—about one thirty-second of an inch—than the thickness
of the member thru which the tenon is to pass. This is to insure the
key’s wedging against the second member. (2) Square this line
across the face edge and on to the side opposite the face side. (3)
On the top surface measure from the line just squared around the
piece a distance equal to the width the key is to have at this point
when in place. Fig. 158, A B. (4) Square a pencil line across the
surface at this point. (5) In a similar manner, measure and locate a
line on the opposite side, C D, Fig. 158. (6) Set the gage and mark
the side of the mortise nearer the face edge on face side and side
opposite. (7) Reset, and from the face edge gage the farther side of
the mortise, marking both sides. (8) This mortise may be bored and
chiseled like the one preceding. As one side of the mortise is to be
cut sloping, a little more care will be needed.

Fig. 162.

92. Blind Mortise-and-Tenon.—Probably no joint has a greater


variety of applications than the
blind mortise-and-tenon, Fig. 162. It is of equal importance to
carpentry, joinery and cabinet-making. The tenon shown has four
shoulders; it is often made with but three or two.
93. Directions for Tenon.—(1) Measure from the end of the piece
the length of tenon, (see also directions
for tenon, Section 89) and mark with the point of a knife. (2) Square
knife lines entirely around the four sides at this point to locate the
shoulders. (3) Lay the rule across the face edge near the end of the
piece and mark points with the end of the knife to indicate the
thickness of the tenon, Fig. 159. (4) With the head of the gage
against the face side, set the spur of the gage in one of these marks,
then fasten the set screw, Fig. 160. Gage on the end and the two
edges as far back as the knife lines. When there are several tenons
remember to mark all of them before resetting. (5) Set the gage in
the other mark, the head of the gage being placed against the face
side; then gage as before. (6) In a similar manner, place the rule
across the face side, mark points with the knife for the width of
tenon, set the gage to these points, and gage on the face and side
opposite as far as the shoulder lines and across the end. The head
of the gage must be held against the face edge for both settings. (7)
Rip to all of the gage lines first, then crosscut to the shoulder lines,
using back-saw. (8) The end of the tenon may be slightly beveled
that it may be started into the mortise without tearing off the arrises
of the opening.
94. Directions for Laying out Mortise.—(1) From one end of the
piece measure the
required distance to the nearer and the farther ends of the mortise.
Mark points with the knife. (2) Square lines across at these points.
(3) Lay the rule across the face into which the mortise is to be cut
and mark points with the knife for the sides of the mortise. (4) Set the
gage as was done for the tenon, the spur being placed in the knife
point mark and the head of the gage being pushed up against the
face. Gage between the cross lines. (5) Reset from the same face
for the other side of the mortise, and then gage.
If a mortise or tenon is to be placed in the middle of a piece, find
the middle of the piece, Fig. 3, Chapter I, Section 1, and with the
knife, place points to each side of the center mark at a distance
equal to one half the thickness or width of the tenon or mortise.
When several mortises or tenons of the same size are to be laid out
and are to be equally distant from a face, the gage needs to be set
but twice for all—once to mark the nearer edges and once for the
farther edges of the tenon or mortise. Should there be several like
members with like joints, the gage settings obtained from the first
piece will suffice for all.
The importance of working from face sides or face edges only,
cannot be overestimated. To work from either of the other two sides
of a piece would make the joints subject to any variation in the
widths or thicknesses of the pieces. To gage from the faces only,
insures mortises and tenons of exact size no matter how much the
pieces may vary in widths or thicknesses.
95. Directions for Cutting Mortise.—Two methods of cutting
mortises are in common use,
(a) boring and chiseling, and (b) chiseling alone. First method: (1)
Fasten the piece in the vise in a horizontal position. (2) Bore a series
of connecting holes to the required depth, Chapter IV, Section 45,
with a bit slightly smaller than the width of the mortise. (3) The sides
of the mortise are next pared to the gage and knife lines, beginning
at the auger holes and working with thin slices toward the lines. This
method requires care and patience in order to get the sides of the
mortise cut square to the surface. It is especially well adapted to
large mortises from which much wood is to be removed.

Fig. 163. Fig. 164.


96. Directions for Cutting Mortise.—Second Method: (1) Clamp
the piece which is to be
mortised firmly to the bench top, using a hand clamp. Fig. 163 shows
a little device called a mortise grip. Tighten the vise screw and tap
the grip with the mallet until it holds the piece solidly. (2) Select a
chisel of a width equal to that desired for the mortise. Stand well
back of the mortise at one end or the other so as to be able to sight
the chisel plumb with reference to the sides of the mortise. (3) Begin
the cutting in the center of the mortise. Make the first cut with the
bevel of the chisel toward you; reverse the bevel and cut out the
wedge-shaped piece, w, Fig. 164. (4) Continue cutting in this manner
until the proper depth has been attained, making the opening no
larger at the surface than is necessary. (5) Set the chisel in a vertical
position, bevel towards you, begin at the center and, taking thin
slices, cut toward the farther end. Drive the chisel the full depth of
the mortise each time, then pull the handle towards you to break the
chip from the sides of the mortise. Cut to within one-eighth of an inch
of the end of the mortise. (6) Reverse the piece, or your position, and
cut in a similar manner to within one-eighth of an inch of the second
end. (7) With the bevel side of the chisel next the end of the mortise
pry out the chips once or twice as the cutting proceeds. (8) Chisel
the ends to the knife lines, carefully sighting the chisel for the two
directions. Fig. 165 suggests the order.

Fig. 165.
97. Miter Joint.—The miter joint is subject to various
modifications. In the plain miter, Fig. 166, the ends
or edges abut. They are usually fastened with glue or nails or both.
The most common form of the plain miter is that in which the slope is
at an angle of forty-five degrees to the edge or side.

Fig. 166. Fig. 167.

98. Directions for Miter Joint.—(1) Lay off the slopes (see
Chapter I, Section 4). (2) Cut and
fit the parts. To fit and fasten four miter joints, such as are found in a
picture frame, is no easy task. Special miter boxes are made for this
purpose which make such work comparatively easy. (3) Fig. 167
shows the manner of applying the hand clamps to a simple miter
joint. When a joint is to be nailed, drive the nail thru one piece until
its point projects slightly. Place the second piece in the vise to hold it
firmly. Hold the first piece so that its end projects somewhat over and
beyond that of the second; the nailing will tend to bring it to its proper
position, Fig. 168. If a nail is driven thru from the other direction, care
must be taken to so place it that it will not strike the first, or a split
join will result.
Fig. 168.

99. Dovetail Joint.—Dovetailed joints are so named from the


shape of the pieces which make the joint. Fig.
169 shows a thru multiple dovetail commonly used in fastening the
corners of tool boxes. In hand made dovetails, the tenons are very
narrow and the mortises wide, while in machine made dovetails,
tenons and mortises are of equal width. Mechanics lay out the
tenons without measurement, depending upon the eye unaided to
give the proper size and shape. Sometimes dovetails are laid out to
exact shape and size, the tenons being marked on both sides and
ends. The mortises are marked with trysquare and bevel after one
side of each has been marked by superimposing the tenons. In
some kinds of dovetailing, such as the half-blind dovetail, the
mortises are made first and the tenons marked out from them by
superposition.
Fig. 169. Fig. 170.

100. Directions for Dovetail Joint.—(1) Square lines around


each end to locate the inner
ends of the mortises and tenons. These lines will be at a distance
from the ends equal to the respective thicknesses of the pieces. (2)
Determine the number of tenons wanted and square center lines
across the end of the member which is to have the tenons. Place
these center lines so that the intervening spaces shall be equal. (3)
Measure along an arris and mark on either side of these center lines
one-half of the desired width of the tenon. In fine hand made
dovetails, the usual width for the narrow edge of tenon is scarcely
more than one-sixteenth of an inch—the width of a narrow saw kerf.
(4) Set the bevel for the amount of flare desired. Fig. 170 shows
measurements which may be used in setting the bevel. A flare stick
may be made of thin wood and used instead of a bevel if desired,
Fig. 170. (5) Mark the flares on either side of the center lines. Place
the bevel so that the wide side of the tenon shall be formed on the
face side of the piece. (6) Carry these lines back on each side of the
piece as far as the lines previously drawn across these sides. (7)
With a fine tenon saw rip accurately to the lines. Cut the kerfs out of
the mortises, not out of the tenons. (8) Chisel out the mortises
formed between the tenons and trim up any irregularities in the
tenons. (9) Set the tenons on end on the face side of the second
member, with the face side just touching the cross line placed on the
second member, Fig. 171, and mark along the sides of the tenons.
(10) Square lines across the end to correspond with the lines just
drawn. (11) Saw accurately to the lines, cutting the kerfs out of the
mortises, not the tails. Chisel out the mortises for the tenons, Fig.
172. (12) Fit the parts together.

Fig. 171. Fig. 172.


CHAPTER IX.
Elementary Cabinet Work.

101. Combination Plane.—The most elementary of cabinet work


necessitates considerable groove
cutting, rabbeting, etc. Rabbets and grooves can be formed by
means of the chisel, the sides first being gaged. A better way, by far,
is to plane them. In earlier practice, joiners were obliged to have a
great variety of special planes—one for each kind of work, and
frequently different planes for different sizes of the same kind of
work. There were rabbeting, dado, plow, filletster, beading, matching
planes, etc., etc.

Fig. 173.

Fig. 173 illustrates a modern combination plane which, by an


exchange of cutters, can be made to do the work of a (1) beader,
center beader, (2) rabbet and filletster, (3) dado, (4) plow, (5)
matching plane, and (6) slitting plane, different sized cutters for each
kind of work permitting of a great variety of uses. By means of a
guide or fence, the plane can be set to cut to a required distance
from the edge of the board. A stop or depth gage can be set so as to
keep the plane from cutting any deeper than is desired. When cutting
across the grain, as in cutting dadoes, adjustable cutting spurs
precede and score or cut the fibers of the wood on either side of the
cutter.
102. Drawer Construction.—The front of a drawer is usually
made of thicker stock than the other
parts. Fig. 174. For example, if the front were to be made of three-
quarter inch stock the sides, back and bottom would probably be
made of three-eighths inch material. Drawer fronts are always made
of the same material as the rest of the cabinet or desk while the
sides, back and bottom are usually made of some soft wood such as
yellow poplar.

Fig. 174. Fig. 175.

Fig. 175 A illustrates a very common method of fastening the


drawer sides to the front. This form is used mainly upon cheap or
rough construction. It is commonly known as a rabbeted joint. The
half-blind dovetail, Fig. 175 B, is a better fastening, by far, and is
used almost exclusively on fine drawer construction.
103. Directions for Rabbeted Corner.—The rabbeted joint, Fig.
175 A, sometimes called a
rebate or ledge joint is made as follows: (1) Line across the face side
of the drawer front at a distance from the end equal to the thickness
of the drawer sides; also, across the edges to the approximate depth
of rabbet. (2) Set the gage and gage on ends and edges as far as
the lines just placed, for the depth of rabbet. (3) Cut the sides of
rabbet, paring across the grain as in cutting the dado. Fasten by
nailing thru the drawer sides into the front, not thru the front into the
sides.
104. Directions for Dovetail Corner.—The front of the drawer
should be laid out and cut
first. (1) Gage on the end the distance the drawer side is to lap over
the front. (2) Without changing the setting of the gage, hold the head
of the gage against the end of the drawer side and gage on both
broad surfaces. Ordinarily, one should not gage across the grain of
the wood nor should the head of the gage be held against other than
a face. A little thought will show why exception has been made in this
case. (3) Square a line across the face side—the inside surface—of
the drawer front at a distance from the end equal to the thickness of
the drawer side. This line gives the depth of mortise for the tails. (4)
The groove for the drawer bottom having been cut, or its position
marked on the end of the front, lay out on the end the half tenons at
both edges so that the groove shall come wholly within a tail mortise.
The amount of flare at which to set the bevel is given in Chapter VIII,
Section 100. (5) Determine the number of tenons wanted and divide
the space between the flares just drawn into the required number of
equal parts and draw center lines for the tenons, Fig. 176. (6) With
the bevel lay off to either side of these center lines the sides of the
tenons. (7) Carry these lines down the face side to meet the line
previously drawn to indicate mortise depth. (8) Saw exactly to the
knife lines, cutting, Fig. 177, the kerfs out of the mortises, not the
tenons. (9) Chisel out the mortises. Fig. 178.
Fig. 176. Fig. 177.

Fig. 178. Fig. 179.

The corresponding mortises and tails may now be laid out on the
drawer side and worked. (10) By superposition, Fig. 179, mark out
the shape of the mortises to be cut in the sides. (11) Saw and chisel
these mortises. Fig. 172.
105. Directions for Drawer.—(1) Square the different members to
size. (2) Groove the front and sides of
the drawer to receive the drawer bottom. These grooves should be
made somewhat narrower than the bottom is thick to insure a good
fit. The under side of the bottom, later, may be gaged and beveled
on the two ends and the front edge, Fig. 180. (3) Lay out and cut in
the drawer sides the dadoes into which the ends of the back are to
be fitted, Fig. 181. (4) Lay out and cut the joints on the front of the
drawer. (5) Get the bottom ready; that is, plane the bevels on the
under side as suggested in 2, above. (6) Assemble the members dry
to see that all fit properly. (7) Take apart; glue the joints by which the
sides are fastened to the front and the joints by which the back is
fastened to the sides. Glue the bottom to the front of the drawer but
not to the sides or back.

Fig. 180. Fig. 181.

Sometimes on large or rough work nails are used instead of glue


to fasten the members together. In this case the front, sides and
back are put together, the back being kept just above the grooves in
the sides. The bottom is then slipped in place under the back. It is
fastened to the front of the drawer only. Especial care should be
taken in squaring the bottom for the squareness of the drawer is
dependent upon this.
106. Paneling.—Often it is desired to fill in a rather wide space
with wood. To offset the effects of shrinkage,
winding and warpage, a panel rather than a single solid piece is
used. By increasing the number of panels a space of any size may
be filled. Fig. 182.

Fig. 182.

107. Cutting Grooves.—Grooves for panels are best cut by


means of the panel plow or combination
plane. It is not necessary to gage for the sides of the groove; the
adjustments of the plane are such as to give the proper depth and
location, when once set, and a cutter of the width equal to that of the
desired groove inserted. The fence of the plane must be held against
one or the other of the faces. Fig. 173.
108. Haunched Mortise-and-Tenon.

You might also like