You are on page 1of 21

Distant Listening to Gertrude

Stein’s ‘Melanctha’: Using


Similarity Analysis in a Discovery
Paradigm to Analyze Prosody
and Author Influence
............................................................................................................................................................

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Tanya Clement
School of Information, University of Texas at Austin, Austin
David Tcheng, Loretta Auvil and Boris Capitanu
Illinois Informatics Institute, University of Illinois at Urbana-
Champaign, Urbana
Joao Barbosa
Texas Advanced Computing Center, University of Texas at Austin,
Austin
.......................................................................................................................................
Abstract
Used here to describe the investigation of significant sound or prosodic patterns
within the context of a system that can translate these patterns into comparative
visualizations across texts, the term ‘distant listening’ is used provocatively to
suggest that readers might interpret prosodic patterns as ‘noise’ (or seemingly
unintelligible information) with close reading practices. In this study, we show
that these same patterns appear coherent and discoverable within ProseVis, a
visualization tool that supports these hermeneutics within a discovery-based
paradigm that allows for new ways of making meaning. Charles Bernstein dis-
cusses ‘close listening’ as possibly contradictory to ‘ ‘‘readings’’ of poems that are
based exclusively on the printed text and that ignore the poet’s own perform-
ances, the ‘‘total’’ sound of the work, and the relation of sound to semantics’
(Bernstein, 1998, p. 4). Likewise, this study considers the efficacy of using pros-
odic textual elements as features for similarity metrics instead of or alongside
words and n-gram frequencies. In particular, this discussion describes the con-
tinued development of this work as a contribution to and within the context of
authorship attribution and stylometric studies that consider the interpretability
of prosodic features. To that end, in the first part of this discussion, we place the
Correspondence: study within the theoretical and practical context of author attribution studies. In
Tanya Clement, School of the second part of this discussion, we consider how changing similarity metric
Information, University of
Texas at Austin, Austin.
calculations through the inclusion and exclusion of certain prosodic features
Email: (such as tone and stress) and algorithmic parameters (such as the window size
tclement@ischool.utexas.edu of sounds and weighting power) can facilitate the discovery of previously

Literary and Linguistic Computing, Vol. 28, No. 4, 2013. ß The Author 2013. Published by Oxford University Press on 582
behalf of ALLC. All rights reserved. For Permissions, please email: journals.permissions@oup.com
doi:10.1093/llc/fqt040 Advance Access published on 23 July 2013
Distant listening to Gertrude Stein’s ‘Melanctha’

unidentifiable author-similarity patterns. Finally, in the third part of this study,


we explore questions of identity construction within this framework of author
attribution analysis by comparing ‘Melanctha’, the longest story in Gertrude
Stein’s Three Lives (1909), with 150 different narrative voices from the First
Person Narratives of the Documenting the American South collection.
.................................................................................................................................................................................

1 Using Prosody Features in Indeed, Efstatios Stamatatos calls the use of topic-
independent words such as function words, the
Similarity Metric and Author ‘pure stylistic choices of the authors across different
Attribution Studies topics’ (2009, p. 540). Burrows describes author at-
tribution studies that use ‘weak discriminators’ such

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Traditional author attribution studies are often per- as these high-frequency function words to deter-
formed to determine the singular author of a given mine an author’s ‘stylistic signature’ as the study
text. Some of the most often cited examples include of ‘tiny strokes’ (2002, p. 268).
determining the author or authors of the Federalist Studies use high-frequency function words as
Papers (Forsyth and Holmes, 1996; Jockers and features for studying style and attributing author-
Witten, 2010; Mosteller and Wallace, 1964) and ship for quite valid reasons concerning efficiency
John Burrows’ work using function words to deter- and interpretable results, but it is misleading to sug-
mine Jane Austen’s particular style (1987) and to gest that high-frequency words have proven to be
determine the authorship of English Restoration- the most productive or accurate textual features for
era poets (2002). Often these studies are profile-
studying style. Jockers and Witten (2010), who
or instance-based, ‘closed games’ (Burrows, 2002,
focus on performing a benchmarking study on
p. 267) in which the texts are of equal length, the
determining the best classifier for authorship attri-
corpora only includes a handful of mostly known
bution problems, use high-frequency words and
authors, the language of the texts are normally ‘in
n-grams with little regard for studies that use
the most similar register’ (Grieve, 2007, p. 255) or of
other features even as they cite feature selection
equivalent dialects, and the texts pertain to ‘a similar
as one of the most significant factors in machine
range of topics’ (p. 256). For the most part, these
learning classification techniques (2010, p. 215).
studies have successfully examined the extent to
To the contrary, Grieve’s study, among others
which a variety of measures, different features of
study, and different parameters and algorithms (Forsyth and Holmes, 1996; Sanderson and
achieve a variety of better or worse results. Guenter, 2006; Zhang and Lee, 2006), proves that
Most of these author attribution studies focus using graphemes as a feature yields very favorable
primarily on the use of high-frequency words or results, commenting that ‘they are the most frequent
n-grams to establish similarities among authorial potential indicator of authorship in any English text,
styles (Burrows, 2002; Diederich et al., 2000; and as such any patterns in their usage will have a
Grieve, 2007; Hoover, 2003a,b; Juola et al., 2006; better chance to emerge’ (2007, p. 261). Sanderson
Koppel et al., 2007; Martindale and McKenzie, and Guenter (2006) also test using character n-
1995; Uzuner and Katz, 2005; Yu, 2008; Zhao and grams of variable length as features with short
Zobel, 2005). High-frequency function words are English texts and produce the best results when
ordinarily chosen because classification algorithms examining character sequences of up to 4-grams.
are sensitive to similarities among low-frequency Forsyth and Holmes (1996) likewise find that
content (Grieve, 2007, p. 260) or ‘context-specific’ using character n-grams yields better results than
words (Jockers and Witten, 2010, p. 217). Although lexical features in many text-classification tasks,
using any word-based metrics might seem to cinch including authorship attribution. So too, in Juola’s
an analysis of style to a study of content or topic, the competition to prove the best performing author
authors cite function words to be context free. attribution algorithm, one of the best performers

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 583


T. Clement et al.

uses character n-grams as features (Juola, 2004; something goes awry, we may have difficulty
Juola et al., 2006). in attributing this to problems with the col-
Most of these studies do not use syntactical fea- lection process or the specification of the
tures such as part of speech or sentence and phrase features (Weiss et al., 2005, p. 51).
structure, primarily because syntactic parsers
On the other hand, Weiss et al. maintain (like
produce many errors and therefore noise. On the
Burrows and others) that the results of text
other hand, Stamatatos cites several studies
mining procedures are easier for developers than
(Baayen et al., 1996; Gamon, 2004; Stamatatos
more quantitative data mining, because the results
et al., 2000, 2001) in which ‘results have shown
include whole words. ‘For text mining’, they write,
that this type of measure performs better than do
‘we are much closer to understanding the data, and
vocabulary richness and lexical measures’ (2009,
we all have some expertise. The document is text.
p. 542). At the same time, Baayen et al. note that

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


We can read and comprehend it, and we analyze a
the use of function words is an ‘economical’ way to
result by going directly to the documents of interest’
measure syntactic features because these words are
(emphasis added; Weiss et al., 2005, pp. 51–2).
tied to syntactical patterns, but they posit that
Certainly, using textual features other than the
syntax-based methods are more robust and lead to
word, such as graphemes and syntactic attributes,
better results than these word-based methods.
for similarity analysis is only productive to the
One striking commonality to these studies is the
researcher in so far as the results produced are
extent to which they are limited by the researcher’s
interpretable. Digital tools can provide new com-
ability to interpret study results. Given the previ-
prehensible interfaces that allow researchers the
ously cited indications that other features than
capacity to interpret the results based on features
the word can be used as productive features in
beyond the word.
attribution studies, it seems that the choice to use
Several studies have identified how we make such
high-frequency function words as a feature for
results more easily accessible for interpretation.
determining similarity might be based on the fact
Juola (2006) contends it is essential that tools be
that words are themselves interpretable results. For
developed, such as his JGAAP (Java Graphical
instance, John Burrows suggests that our collective
Authorship Attribution Program) prototype,
understanding of function words can provide a kind
which allows the uninitiated to try their hand at
of ground truth for measuring algorithmic accura-
authorship attribution study. He cites accuracy
cies: ‘The advantage of working with whole words’,
and usability as major concerns for such tools,
he writes, ‘rests on their accessibility and their
claiming that the only way to have better tools is
meaningfulness. They help us, in particular, to
to attract more users, who in turn are attracted by
form close and fruitful inferences about the out-
better and more usable tools (Juola et al., 2006,
come of an inquiry’ (Burrows 2002, p. 268).
p. 170). Better and more usable tools means tools
Burrows’ ideas reflect those in the data mining com-
that allow users to learn about the study of author
munity in which not being able to understand
attribution, their data sets, and the tool at the same
results is a serious limitation for the use of some
time. In other words, better and more usable (or
algorithms. The following is a description of typ-
useful) tools are tools that allow for an iterative
ical data mining procedures for developers who
interaction with the data. As we have seen, choosing
generally ‘have only a superficial understanding’
texts and features helps to fine-tune analysis, but
(Weiss et al., 2005, p. 51) of what is usually numer-
studies also show that choosing parameters such
ical data:
as weighted combinations of the best algorithms
They accept what they are given by the on the same corpus (Grieve, 2007; Sutton et al.,
domain experts and do not have a deep 2005) could advance research in author attribution.
understanding of the measurements or their Further, for instance-based similarity metrics,
relationship with each other. Results are ana- Koppel et al. (2007) argue that the ability to slice
lyzed primarily by empirical analysis. When or ‘unmask’ the results is productive. In the Koppel

584 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

study, ‘unmasking’ refers to systematically dimin- using a supervised learning paradigm. In the super-
ishing the number of features for study to ‘gauge vised learning paradigm, the goal is to maximize
the speed with which cross-validation accuracy de- predictive accuracy. For instance, a researcher
grades as more features are removed’ to determine wants to determine if Shakespeare wrote a given
the depth of difference between texts (p. 1264). In text. This can be modeled as a two-class prediction
other words, by letting researchers slice or ‘unmask’ problem based on labeled examples. One class
results in different ways, studies are strengthened. would be all Shakespeare documents and the other
Authorship attribution studies and stylometric ana- class would be documents Shakespeare did not write
lysis in general point to the fact that it is essential from the same period and location. The perform-
when working with advanced computational simi- ance of the system would be measured by predictive
larity metrics to support the user’s ability to inter- accuracy, meaning how likely the machine learning
pret the process and the results and to ask iterative system can predict whether a new unseen text was

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


questions about which texts to choose for analysis, written by Shakespeare. Based on the set of training
which features to measure, which parameters for examples, the machine learning system would create
analysis are the most productive, and how the re- a mathematical model and discover system param-
sults might differ when each of these aspects are eters for achieving highest accuracy to predict the
calibrated differently. author of unseen books. Once the system has been
trained, the resulting model can be viewed as a
‘Shakespeare’ text detector. Given a new text, it
2 Changing the Software would predict what sections are more likely to
Environment for the Advancement have been written by Shakespeare. The process of
discovering the best system parameters for algo-
of Scholarly Research Flow and rithms is called bias optimization. For supervised
Updating ProseVis with Discovery learning, bias optimization can be automated
Similarity Metrics based on known examples.
For this study, to extract the textual features we
This section addresses two areas of development in need, we incorporate OpenMary, a text-to-speech
this study that advance research on prosody analysis application tool for extracting aural features into
on documents written by different authors. Using the ‘flow’ we coordinated in Meandre, a data flow
an analysis service (Meandre) and a visualization environment developed by the SEASR (Software
tool (ProseVis), this study demonstrates the afore- Environment for the Advancement of Scholarly
mentioned iterative discovery processes for similar- Research) team at the University of Illinois at
ity analysis. First, we discuss our approach, which Urbana-Champaign. We developed ProseVis as a
includes discovery techniques that bring the role reader interface that would allow readers to com-
of the researcher and what counts as ‘interpretable’ pare the prosodic patterns that resulted from the
results into focus. Second, we discuss our develop- predictive modeling procedures in and across texts.
ment of Meandre and ProseVis, a coupling of data The supervised learning method we were using in
flow and interactive visualization interface that a previous study (Clement et al., 2013) quickly
allows users to choose texts, a variety of prosodic became unwieldy and computationally expensive
features, the size of phrases windows,1 the weighting when we sought to scale up our number of texts
power, and smoothing parameters for iterative test- of study to compare ‘Melanctha’, the longest story
ing and comparing results. in Gertrude Stein’s Three Lives (1909), with 150 dif-
ferent narrative voices from the First Person
2.1 Supervised learning versus Narratives (FPN) of the Documenting the
discovery techniques American South digital publishing initiative . . . In
Before discussing current developments in this the previous study, if a user initiated a prediction
study, it is useful to consider comparative results problem, the results changed every time the user

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 585


T. Clement et al.

added a new document. By definition, prediction window size equal to 14 and uses accent, stress,
modeling is asking a closed set question: This and tone, then 14  3 is the number of features
phrase came from which of these particular texts? for each example). An example is a feature vector
In addition, our initial prosody research used super- describing the window in terms of the chosen fea-
vised learning with bias optimization to determine tures. Distance is computed as the sum of the abso-
the best system parameters, so it was computation- lute value of the differences between all features in
ally intensive. If the collection of documents chan- the feature vector.
ged, then the whole analysis would need to be run First, we randomly chose 10,000 samples for each
again. For a result, it also took days to discover the of the 150 FPN documents and for each Stein text,3
best system parameters. Besides the fact that scaling with each sample comprising the five-feature set
up was computationally expensive, this method- described previously. Next, this 10,000 random
ology did not facilitate the kind of iterative user sample from each work was compared with the sam-

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


interaction we were seeking. For an interactive pros- ples from each of the other documents. The similar-
ody analysis tool, it would have been unacceptable ity metric we are currently using to compare phrase
to wait so long each time the collection changed. windows is a form of inverse distance weighting.
On the other hand, within the discovery para- Training Example Weight ¼
digm, judging the system performance is
1=ðDistance To Testing ExampleÞ ^
subjective . . . An example problem of this paradigm
would be to understand the similarities between Distance Weighting Power
various authors’ writing styles. The researcher first
The results of the comparison with FPN are pic-
selects representative texts for each author and then
tured in the confusion matrix in Fig. 1. This confu-
uses the system to measure the similarity based on sion matrix represents a summary of the number of
different features. Each set of different parameters samples that were like each of the documents. Each
would produce different results and visualizations, row and column shows data about the text from
because the similarity metric could be based on which we drew the samples. Each sample was deter-
many different system control parameters, including mined to be like some other sample and those
phrase window size, feature selection, distance counts increase in color darkness according to that
weighting power, and smoothing factors. Some ex- scale in the boxes. For all the works, the highest
periments of parameters settings may not produce prediction is for the work itself (indicated by the
informative results, but because the researcher can dark diagonal line). Each time a new text is added
explore the effects of different similarity metrics and to the collection, the similarity between feature win-
learn how authors compare along these various di- dows in the new text and windows in all other texts
mensions, she has a better chance of discovering can be computed and added to the matrix.
ones that reveal meaningful patterns. The tables show lists of FPN documents and
For this study, we implemented a similarity- counts based on two different perspectives of simi-
based discovery paradigm. We performed an initial larity between FPN and Three Lives. Table 1 includes
pass at comparing Three Lives with the 150 FPN counts of FPN samples that are higher when FPN
documents using sampling to identify a subset of samples are compared with Three Lives samples.
documents to work with in more detail. Because Table 2 includes counts of Three Lives samples
our goal was to examine prosodic features, we did when Three Lives samples are compared with FPN
not include the word or the sound in our initial samples. Because we wanted to verify the similarity
pass. We included part of speech, accent, stress, metric and to establish the style of Three Lives in
tone, and the break index.2 In this study, each fea- comparison with other Stein texts, we included a
ture is represented by itself and is not combined few of her other works. As a result, the counts are
into a symbol. Correspondingly, each example is lower in the first table because some of the FPN works
represented by the phrase window size times the are more similar to each other than the Stein works.
number of attributes (e.g., if the user sets the In the second table, one can see Three Lives is most

586 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 1 This confusion matrix shows the number of 10,000 samples that were alike in the Gertrude Stein/FPN corpus.
Each row and column shows data about the text from which we drew the samples. The highest predictions are for the
work itself (indicated by the dark diagonal line)

similar to two texts by Stein and then interwoven which texts to look at more closely for comparison
between FPN corpus and the other Stein works. within the ProseVis environment.
The tables represent a range of documents writ-
ten by a variety of men and women from different 2.2 Using the Meandre/ProseVis
racial backgrounds. Specifically, in the FPN collec- discovery system
tion, there are 154 authors4: 49 are female, of whom
In the ProseVis webform5, the researcher is given
45 are white and 4 are former slaves; 105 are male,
the opportunity to upload a selection of texts, and
of whom 77 are white and 28 are former slaves.
control the features to use for the analysis.6 The
When we compared Three Lives with samples from
following are the parameters researchers can use
FPN, the top 10 matches (listed in Table 1) included
to control the experiment:
eight women and two slave authors. When we com-
pared samples from Three Lives with the FPN texts, Comparison Range—This is comma-
two female authors appear in the top 10 matches separated list of indices of the documents to
and five slave authors. The system picked two of the be compared. For example, the user can choose
four slave narratives written by women for this top to compare just the first document with the
list. This initial study provided an indication of remaining documents in a set by using ‘1’.

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 587


T. Clement et al.

Table 1 This list includes counts of FPN samples that are higher when FPN samples are compared with Three Lives
samplesa
Author #FPN like TL
Grimball, Margaret Ann Meta Morris, 1810–1881. Journal of Meta Morris Grimball: South Carolina, 73
December 1860-February 186.
Pringle, Elizabeth Waties Allston. A Woman Rice Planter. 71
Avary, Myrta Lockett. A Virginia Girl in the Civil War, 1861-1865. 69
Battle, Laura Elizabeth Lee. Forget-me-nots of the Civil War; A Romance, Containing Reminiscences and Original 69
Letters of Two Confederate Soldiers
LeConte, Joseph. The Autobiography of Joseph LeConte. 64
Dawson, Sarah Morgan. A Confederate Girl’s Diary. 63
Veney, Bethany. The Narrative of Bethany Veney: A Slave Woman. 62

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Edmondson, Belle. Diary of Belle Edmondson, January - November, 1864 57
Zettler, B. M. (Berrien McPherson). War Stories and School-Day Incidents for the Children 51
Jacobs, Harriet A. (Harriet Ann). Incidents in the Life of a Slave Girl, Written by Herself 50
Other stein works
Stein, Three Lives 849
Stein, ‘Miss Furr and Miss Skeene’ 108
Stein, Making of Americans 71
a
There are only 5407 samples from other works that are like Three Lives (including the 849 that are from Three Lives and 108 from
‘Miss Furr’).

Table 2 This list includes counts of Three Lives samples when Three Lives samples are compared with FPN samples
Author and title #TL like FPN
Malone, Bartlett Yancey. The Diary of Bartlett Yancey Malone 598
Horton, George. The Poetical Works of George M. Horton: The Colored Bard of North Carolina: To Which is Prefixed 390
the Life of the Author, Written by Himself.
Ward, Dallas T. The Last Flag of Truce. 296
Patton, James. Biography of James Patton. 249
McLeary, A. C. Humorous Incidents of the Civil War. 220
A Georgia Negro Peon. The New Slavery in the South–An Autobiography. 215
Jones, Thomas H. The Experience of Rev. Thomas H. Jones, Who Was a Slave for Forty-Three Years. Written by a 206
Friend, as Related to Him by Brother Jones
Horton, George. The Life of George M. Horton. The Colored Bard of North Carolina. 180
Mitchel Cora. Reminiscences of the Civil War. 179
Roper, Moses. A Narrative of the Adventures and Escape of Moses Roper, from American Slavery. 157
Other stein works
Stein, Three Lives 849
Stein, Four Saints 719
Stein, ‘Matisse’ 395
Stein ‘Picasso’ 371
Stein, ‘Miss Furr and Miss Skeene’ 305
Stein, Making of Americans 178

Using ‘1, 3, 7’ means that the first, third, and Window Size in Sounds—This is the number
seventh documents will be compared against of phonemes to be considered a phrase for
each other and all of the other documents. analysis. Because we are working on prosodic
Using ‘all’ means that all documents will be patterns that are affected by phrasal patterns
compared with each other. (Clement et al., 2013), it makes sense for this

588 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

value to represent the average number of used in image processing and statistics to ‘blur’
sounds in a phrase. If texts use shorter phrases, out more detailed features and emphasize the
then a smaller window serves as a better repre- larger scale features. In ProseVis, smoothing is
sentation of the average phrase size for a used to find longer patterns by averaging the simi-
given text. If texts have longer phrases, then a larity values over a neighborhood. Using the data
larger window might yield more productive produced through Meandre to compute document
results. similarity based on prosody features, ProseVis allows
Sound Features to Use—This refers to the at- researchers to explore these results mapped back to
tributes of the sounds, which are determined the original text with colors. By default, Meandre
from the features extracted by OpenMary, a returns a collection of raw similarity values on a
pre-processing module in Meandre. As per-syllable-per-document basis that is often too
described in Clement et al. (2013), this small to display without some form of normaliza-

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


module uses the OpenMary text-to-speech tion. The ProseVis tool ensures that all similarity
software. Attributes are explained in detail in values are scaled according to their relative weights
the OpenMary Documentation, specifically at in the interval [0.1] by determining the global max-
http://mary.dfki.de/documentation/module- imum and minimum similarity values and scaling
architecture/. The features can be used one at a each similarity value ‘v’ by the function (v-min)/
time or in combination. The sound features (max-min). Within ProseVis we determine the rela-
that we are using include (1) part of speech; tive weight of each document per syllable and
(2) accent, which indicates the pitch of the choose the highest value for coloring. The value
sound; (3) stress, which indicates the presence only takes into account all comparison values
of a primary or secondary lexical stress; among documents that are selected. Smoothing is
(4) break index, which indicates when sounds
necessary to take into account the context of the
precede phrase breaks, sentence breaks, and
syllable with respect to its neighbors. The smoothing
paragraph final breaks; and (5) tone, which
function averages the similarity values of the N
indicates the location of prosodic boundaries
neighbors to the left and to the right of the syllable
and pitch accents by assigning sentence type
with a window size of 2  N þ 1, thus incorporating
(e.g., declarative, interrogative-W, interroga-
the notion of context during this coloring
tive-Yes-No, and exclamatory).
step. While in a supervised learning system, the
Weighting Power—This number radically
algorithm can adjust the smoothing factor to
controls the behavior of the instance-based
learner. Valid values are in the range 0 to achieve maximum predictive accuracy. In a dis-
100. When weighting power is set to the high- covery system, the researcher can adjust the
est value, it heavily weights close matches when smoothing factor to maximally reveal the features
computing similarity. When set to the lowest of interest.
value, it equally weights all matches. Higher The following figures show examples of results
weighting power values caused our instance- based on changing all of the aforementioned param-
based learning system to use a nearest- eters. Figure 2 shows the ProseVis interface. In this
neighbor strategy. With lower values, more image, each sound in ‘Melanctha’ is colored accord-
weight is given to distant examples, which ef- ing to the document to which the system has given
fectively increases the neighborhood size. that sound the highest similarity value. What is sig-
When set to zero, all training examples are nificant in the findings is that particular FPN docu-
equally weighted, resulting in a constant pre- ments seem to map in surprisingly regular patterns
diction reflecting the baseline class to certain parts of ‘Melanctha’. Specifically, the
probabilities.7 green blocks shown in Fig. 4 indicate sections of
the text that the system has determined sound
The last parameter the researcher may control is most similar to The Diary of Bartlett Yancey
the smoothing factor. Smoothing is a technique Malone (1919), written by a Confederate farmer

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 589


T. Clement et al.

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 2 ‘Melanctha’ colored in ProseVis according to similarity with FPN documents listed in Fig. 3. Color versions of all
figures are available in the online version of the paper

turned soldier and sergeant from North Carolina. sounds with similarity analysis using the features
The blue blocks indicate parts of ‘Melanctha’ that discussed earlier—part of speech, accent, tone,
the system has determined sound most similar to stress, and break index. Each panel shows a differ-
The Poetical Works of George M. Horton: The ent weighting power, from left to right: these are
Colored Bard of North Carolina: To Which is 16, 32, and 64. Although the blocks of colors (pri-
Prefixed the Life of the Author, written by himself marily blue and green) are the same in all three
(1845). The colors range in intensity based on the panels, the left panel shows larger blocks of color
value of the similarity value. Figure 3 shows the than the panel on the far right, where the colors
tool panel where a user can see which colors corres- are more varied. This might indicate that to exam-
pond to the texts in this similarity study and the ine the texts in this sample for longer textual pat-
check boxes that allow a user to deselect a text terns (i.e., multi-phrasal blocks, sentences, or
and remove it from the comparison. For in- paragraphs) that make sense to readers, the lower
stance, if the Horton document (labeled here as weighting power will yield more productive
‘hortonpoem’) were deselected, the blocks that are visualizations.
blue would change to reflect the color of the text Figure 5 is also a comparison of three versions of
with the next highest similarity metric. Examples of results on Three Lives. In this view, the researcher
this ‘unmasking’ are included in the third section has chosen to differentiate which features to choose.
of this article. Each panel includes results produced using the
First, we tested various parameters such as 14-sound window and 16 for a weighting power.
weighting. Figure 4 below shows three panels, The difference here is that the first panel includes
each showing ‘Melanctha’ from Three Lives. These all the features used previously, the second includes
results are based on using a phrase window of 14 all but break index, and the third contains all but

590 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

part of speech. These results show that including


both parts of speech and break index is important
to produce more productive results.
Figure 6 shows two panels that contain the
same excerpt from ‘Melanctha’ based on results
from similarity analysis using a 14-sound phrase
window and a 16 weighting power. The first panel
has a smoothing value of ‘1’, whereas the panel on
the right has a smoothing value of ‘15’. In this case,
the smoothing value of ‘15’ emphasizes the differ-
ently colored blocks and helps to facilitate the user’s
interpretation of the results.

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


These examples show the different kinds of
results that researchers can produce within the
iterative discovery-based paradigm that ProseVis
project facilitates. In the later discussion, using the
14-sound phrase window with a weighting power
of 16 and a smoothing factor of 15, we demonstrate
how ‘unmasking’ the data in the ProseVis interface
Fig. 3 From the ProseVis control panel, the list of FPN is productive for analyzing identity construction in
authors being compared with Three Lives and their asso- Gertrude Stein’s short story ‘Melanctha’.
ciated colors

Fig. 4 Three ProseVis panels, each with an excerpt from ‘Melanctha’ showing based on a 14-sound window and
different weighting powers from left to right: 16, 32, and 64

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 591


T. Clement et al.

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 5 Excerpt from ‘Melanctha’ colored in three ProseVis panels based on a 14-sound phrase window and 16 for a
weighting power; the first panel includes all the features used previously, the second includes all but break index, and
the third contains all but part of speech

3 Distant Listening to Gertrude these readings by suggesting that the relationships


Stein’s ‘Melanctha’ between form and content, style and philosophy,
and aesthetics and politics in ‘Melanctha’ create a
Gertrude Stein scholars have written extensively more variegated look at identity construction
about the influence that African American speech (including race and gender) in this text.
patterns may have had on Stein’s style of writing Peterson and Smedman’s readings of ‘Melanctha’
during the period in which she wrote ‘Melanctha’, are of particular interest in this study because they
the longest ‘life’ or short story in her book Three indicate that the prosodic elements of Stein’s
Lives. Some scholars find her treatment of race in ‘Melanctha’ point to ‘shared’ racialized and gendered
‘Melanctha’ to be ‘pernicious’ (Fullbrook, 1990, identities that cannot be easily classified. Peterson,
p. 69) or ‘vicious’ (Saldı́var-Hull, 1989, p. 190). who contends that Stein’s ‘inspiration derive[d]
Milton Cohen (1984) creates a chart that organizes quite specifically from Baltimore and, in the case of
the characters ‘into a racial hierarchy that is [. . .] [the story of] ‘‘Melanctha,’’ from African American
ominously schematic’ (p. 119). Other scholars such Baltimore’, (1996, p. 141) identifies repetitive phrasal
as Richard Bridgman (1970) and writer Claude patterns that evoke syncopated rhythms much like
McKay (quoted in Brinnon, 1959, p. 121) find the the ragtime music that was popular in Baltimore at
characteristics of Melanctha’s friends and family to the time of its writing. Of significance to this study
be stereotypes and caricatures that have little to do are Peterson’s claims that ‘Melanctha’ captures the
with the story. Still others, such as Carla Peterson blurred racial identities, the ‘complex racial border-
(1996) and Lorna Smedman (1995), complicate land’ of ‘early twentieth century America where

592 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 6 Excerpt from ‘Melanctha’ colored in three ProseVis panels based on a 14-sound window and 16 for a weighting
power

blood lines are often blurred and cultural traditions work with thematic and narrative elements to create
merged’ (p. 144). Specifically, Peterson argues, Stein an interplay between sounds and syntax that gestures
captures this perspective by appropriating African toward a story of shared cultures.
American musical traditions—coon songs, early Continuing with the vein of inquiry suggested by
folkblues, and ragtime music—with her prosody Peterson and Smedman, we compare Gertrude
that historically have been ‘inextricably bound’ to a Stein’s ‘Melanctha’ with 150 FPN of the American
variety of American ethnicities and cultural back- South collection to interrogate how the system meas-
grounds. In addition, Peterson and Smedman point ures the extent to which Stein’s ‘Melanctha’ sounds
to Stein’s ‘double identity as a Jew and a lesbian’ like or contains prosodic elements similar to those
(Peterson, 1996, p. 155) as a thematic element in found in these narratives. Self-described, the FPN ‘is
the text that is also inscribed in its racial discourse, a collection of diaries, autobiographies, memoirs,
specifically in its work to investigate racialized signi- travel accounts, and ex-slave narratives written by
fiers (Smedman, 1995, p. 570). ‘Since Stein’s linguis- Southerners. The majority of materials in this collec-
tic tampering involved an erotics and experience tion are written by those Southerners whose voices
outside of the normative heterosexual boundaries’, were less prominent in their time, including African
Smedman writes, ‘it is not surprising that she makes Americans, women, enlisted men, laborers, and
the link between ‘‘improper’’ racialized language and Native Americans’.8 Even though ‘Melanctha’ is writ-
‘‘taboo’’ sexuality so often in these texts’ (p. 571). In ten from the third-person perspective, it is written in
other words, these critics are arguing that Stein uses the free indirect style. In the free indirect style, a
stylistic features—specifically prosodic elements—to character’s way of speaking, either out loud or in

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 593


T. Clement et al.

his or her thoughts, dictates the style of narration, (p. 86), are facing the denouement of their relation-
making the narrative much like a first-person narra- ship, the building of which has formed the central
tive. Using ProseVis to distant-listen to ‘Melanctha’ narrative of the story. After this point, their rela-
by comparing its prosodic elements with those in the tionship begins to unravel. The break up has been
FPN documents allows for new readings of the text’s foreshadowed: from the beginning of the story, Jeff
portrayal of identity construction as it corresponds Campbell ‘did not like Melanctha’s ways,’ and ‘he
to the sound of the text. did not think that she would ever come to any good’
(Stein, 2004, p. 77). Melanctha’s ‘way’ through the
3.1 Discussion: Unmasking the sound of text is to ‘wander’ both sexually and intellectually,
what the narrator calls ‘wandering after wisdom’. At
identity construction in ‘Melanctha’
the point of the text shown in Fig. 7, Jeff has heard
This study’s driving question is not to ask the ques- more rumors about Melanctha’s past from

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


tion ‘Does ‘‘Melanctha’’ sound like narratives writ- Melanctha’s former lover Jane Harden and yet he
ten by African Americans and women written in and Melanctha have had ‘much joy between them,
approximately the same time period?’ because such more than they ever yet had had with their new
a question is not only rendered mute by questions feeling. All the day they had lost themselves in
concerning the authenticity of these narratives (who warm wandering. Now they were lying there and
wrote them) but also because first-person narratives resting’ (Stein, 2004, p. 105). After this encounter,
or autobiographies are politicized and embodied Jeff suddenly ‘threw Melanctha from him’ (p. 106).
documents, and as such, make slippery signifiers. Feeling guilty, Jeff explains to Melanctha that when
As Sidonie Smith writes, an autobiography ‘histori- he met her, he only knew ‘two kinds of way of
cizes identity implicitly, if not explicitly, insists on loving, one way the way it is good to be in families
the temporalities and spatialities of identity and, in and the other kind of way, like animals are all the
doing so, brings the everyday practices of identity time just with each other, and how I didn’t ever like
directly into the floodlights of conscious display’ that last kind of way much for any of the colored
(1993, p. 160). In other words, the FPN collection, people’ (p. 107). Melanctha, he explains, has shown
filled with romanticized stories of soldiers at war, him a third way of living that is ‘what really loving is
widows and wives and daughters at home, and like’ (p. 107), but Melanctha, he determines here
slaves who were abused and demoralized and after a day of intense ‘wandering’, is ‘a bad one’.
escaped, is not representative of writing that signifies Melanctha’s feelings are hurt and they talk and con-
race or a gender as much as it is a collection that tinue their relationship, but this cycle of sex, retri-
shows a set of writers expressing or practicing iden- bution, and unraveling trust continues until they
tity (which was often racialized and gendered) at a part for good at the end of the story.
certain time and in a certain geography, specific- What is significant about this narrative pattern
ally in the South at the end of the 19th century. that seems to correspond with Jeff’s ‘two kinds of
Similarly, this study shows that the system, which way’ is the extent to which the prosodic patterns we
is not aware of gender and race, can be sensitive to can see in ProseVis in Fig. 2 and Fig. 4 mirror the
the ‘masks’ that gendered and racialized language same kind of vacillation between two kinds of nar-
can assume. ration in the story, one that comprises short,
The next set of images shows the results of our clipped, and simple sentences on the one hand,
study visualized in ProseVis as well as the similarity which usually map in similarity to the Malone docu-
patterns that a researcher can ‘unmask’ on key para- ment (in green), and long, multi-phrasal, and com-
graphs in ‘Melanctha’. In the paragraphs shown in plex sentences on the other, which usually map to
the following figures, Melanctha, who is described as the Horton document (in blue). Often but not
‘a graceful, pale yellow, intelligent, attractive always, the green clipped text maps to moments in
negress . . . half made with real white blood’ (Stein, the story that correspond to Jeff’s actions when he is
2004, p. 58), and her boyfriend, Dr. Jeff Campbell, not consciously thinking about Melanctha or when
who is described as ‘an intelligent good mullato’ he is feeling negatively towards her and her

594 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 7 ‘Melanctha’ excerpt colored in ProseVis according to similarity with FPN documents listed in Fig. 3

wandering ways. The following is an example of a what was the right way for me, to live
green-colored section: regular . . . (Stein, 2004, p. 108)
She began to tell everything she ever knew Further, with a wider view of the entire story as visua-
about you. She didn’t know how well now lized in Fig. 2, the researcher can see that the begin-
I know you. I didn’t tell her not to go on ning of the story, which corresponds to the narrative
talking. I listened while she told me everything of Melanctha’s upbringing and the maturing and so-
about you (Stein, 2004, p. 102). lidifying of her ‘wandering’ ways, is predominately
like Horton’s document, whereas the end of the text,
The blue maps to the more loose and multi-phrasal
which corresponds with Melanctha’s decline into
text that corresponds to either a description of
despondency and ultimately sickness, the ceasing of
Melanctha’s actions and thoughts or Jeff’s when he
her wandering, shows more similarity with Malone.
is feeling positive or affected by Melanctha. The fol-
At first glance at these patterns, it would seem that we
lowing is an example of a blue-colored section:
could make a simple assertion that the system found
I see that now, sometimes, the way you cer- the slave narrative (Horton’s document) to sound
tainly been teaching me, Melanctha, really, more like Melanctha’s wandering narrative, whereas
and then I love you those times, Melanctha, it found the narrative corresponding to Jeff’s way of
like a real religion, and then it comes over me thinking to sound more like that told by the soldier
all sudden, I don’t know anything real about (Malone’s document).
you Melanctha, dear one, and then it comes As discussed earlier, however, interacting with
over me sudden, perhaps I certainly am wrong the data is an important advancement in similarity
now, thinking all this way so lovely, and testing. By ‘unmasking’ (Koppel et al., 2007) or de-
not thinking now any more the old way selecting texts in ProseVis, the researcher can quickly
I always before was always thinking, about see how these similarity patterns are more complex.

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 595


T. Clement et al.

For example, if the researcher starts with selecting all Dallas T. Ward’s The Last Flag of Truce, the story of
of the texts as shown in Fig. 4, the green and blue a railroad conductor or merchant (the history is
blocks are evident. unclear) who was asked to make the Confederates’
In Fig. 8, the researcher has deselected the blue truce flag of surrender.
Horton text to reveal a purple pattern that indicates To summarize, by unmasking the blue Horton
similarity with The Narrative of Bethany Veney: A document (written by a male slave), the researcher
Slave Woman (1889). This is because for the same reveals the purple Veney document (written by a
section of the text, the next highest value for these female slave) and then the pale purple Dawson
phrases corresponds to the Veney document. The document (written by a female Confederate). By
green block remains mostly unchanged. unmasking the green Malone document (written
In Fig. 9, both Horton and Veney are deselected by a male Confederate), the researcher reveals the
and a pale purple is revealed, indicating that the pink McLeary document (written by a male

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


document with the next highest number of likeness Confederate) and the pale green Ward document
corresponds with Sarah Morgan Dawson’s A (written by a male Confederate).
Confederate Girl’s Diary (1913). These initial discoveries into the similarities be-
Unchanged by these unmaskings of the blue tween ‘Melanctha’ and the FPN narratives invite the
blocks, the green blocks indicate a different relation- researcher to consider two new research questions
ship of similarities. In Fig. 10, the researcher has about identity construction in Gertrude Stein’s
deselected the green Malone document and finds a ‘Melanctha’. Smith writes that autobiographical nar-
pale red pattern that indicates that the next highest ratives ‘carry with them through these negotiations
numbers of similarity for this part of the text cor- the specificities of their material circumstances, their
respond to A.C. McLeary’s Humorous Incidents of degrees of self-consciousness about cultural deter-
the Civil War (1902). mination, the temporalities of their bodies’ (Smith,
In Fig. 11, the researcher has deselected McLeary 1993, p. 22). A first research question that is pro-
to unmask a pale green that indicates a similarity to voked by this discovery experience might consider

Fig. 8 The Horton (blue) document comparison has been deselected to reveal the Veney (purple) document similarity

596 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 9 Both the Horton and Veney document comparisons are deselected and the Dawson similarity (pale purple) is
revealed

Fig. 10 The Malone document comparisons have been deselected and the McCleary similarity (pink) is revealed

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 597


T. Clement et al.

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Fig. 11 Both the Malone and McLeary document comparisons have been deselected and the Ward similarity (pale
green) is revealed

the ‘temporalities of identity’ on ‘conscious display’ it is, possessing only the merit of a ‘‘plain,
(Smith, 1993, p. 160) in the FPN and the ‘Melanctha’ unvarnished tale,’’ it asks for generous consid-
documents. For instance, the blue group authors all eration and extended sale.
taught themselves to write. Only Dawson had 10
Taken at face value, this comment would seem to
months of formal schooling. The blue group docu-
indicate that Veney’s document was written in a
ments are also self-proclaimed ‘literary’ documents.
plain style, but the intent of the text is to convince
Horton, who was the first African American to pub-
the audience that ‘the biographies of saintly, endur-
lish a book in the American South, wrote poetry. His
ing spirits like that of Betty Veney will be read, and
document is primarily a book of poetry with a long
will serve to inspire the discouraged and down-trod-
personal narrative as an introduction. Veney’s nar-
den to put their trust in the almighty arm of
rative, a tract that is meant to illustrate Veney’s
Jehovah’, and it is clear that the bishop’s introduc-
Christian character to a Reconstruction-era reader-
tion is meant to quell concerns that such a tale, writ-
ship still reeling from the war, is introduced by Rev.
ten in a style to inspire empathy and religious
Bishop Mallalieu, and includes ‘Commendatory
zeal, might be untrue. Finally, the last of the blue
Notices from Rev. V. A. Cooper, Superintendent of
group documents by Dawson is introduced by a
Home for Little Wanderers, Boston, Mass., and Rev.
long introduction from her son, who describes the
Erastus Spaulding, Millbury, Mass’. The Rev. Bishop
narrative’s ‘flowing sentences’, its ‘certain uses of
Mallalieu writes:
words to which the twentieth century purist will
It is greatly to be regretted that the language take exception’, and its likeness to Victorian litera-
and personal characteristics of Bethany cannot ture as a ‘remarkable feat of style’ (p. xii). The au-
be transcribed. The little particulars that give thors’ backgrounds and the assumed audiences for
coloring and point, tone and expression, are the green group of documents are remarkably differ-
largely lost. Only the outline can be given. As ent. Two of the writers (Malone and Ward) are

598 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

soldiers. Both Malone and McLeary’s diaries are re- to elude researchers who work in attribution studies
ports of daily happenings. Malone’s is described as corresponds to the mixed borderline of racialized
‘Reported in a simple and matter-of-fact manner, and gendered identity construction to which
include notations on his diet, his regiment’s marches, Peterson and Smedman refer. Like voices and iden-
and biblical texts referred to in the sermons he hears’. tities constructed with mixed histories and mixed
Ward’s tale is also matter-of-fact and dedicated ‘To influences, texts are often the result of collaborative
the Soldiers’. It is introduced with letters from a authoring. The lack of studies that consider this
businessman and a judge who attest to its veracity. aspect of texts in authorship attribution has been
The similarities that tie the ‘blue group’ documents described as ‘a pitfall’ common to attribution stu-
(Horton, Veney, and Dawson) together and the dies (Eder, 2012). Eder contends that the future of
‘green group’ documents (Malone, McLeary, such study rests in using ‘stylometric techniques to
Ward) together to the same spots in ‘Melanctha’ trace stylistic imitations or unconscious inspirations

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


are also based on stylistic differences that are the between different authors’ (Eder, 2012) and Collins
direct result of the authors’ educational backgrounds et al. (2004) pushes the boundaries of these limits by
and their intended audiences, which are evidenced attempting to characterize[e] the Federalist Papers
by the introductions and introductory letters that ‘according to the representational language choices
accompany most of these writings. of the authors, similar to a way we believe close
A second question the researcher is provoked to human readers come to know a text and distinguish
consider concerns the presence of three FPN docu- its rhetorical purpose’ (Collins et al., 2004, p. 15).
ments that appear regularly in intermittent patterns Specifically, Collins uses the frequencies of ‘repre-
across ‘Melanctha’. These are Harriet Jacobs’ text sentational language strings’ he has identified that
Incidents in the Life of a Slave Girl, written by herself indicate ‘subtle rhetorical impressions’ (Collins
(1861) in orange; Margaret Ann Morris Grimball’s et al., 2004, p. 15). He uses these strings to differ-
Journal of Meta Morris Grimball: South Carolina, entiate and attribute parts of the Papers to Hamilton
December 1860-February 1866 in red; and Elizabeth and Madison. Collins writes that these quantitative
Waties Allston Pringle’s A Woman Rice Planter measurements help to facilitate nonquantifiable as-
(1914) in gray. All three have a comparable similar- pects of collaborative writing and influence.
ity metric based on 10,000 samples (Table 1) but Our study reflects the development of a system
never appear as solid blocks in the text as the for detecting similarity that also allows the re-
green group and blue group patterns do. Instead, searcher to assert her understanding of the text as
they have a sporadic but constant appearance in she reads or listens to it, from a distance and up
the text across the other blocks of color. It is these close, in an iterative fashion. Stein writes of her syn-
constant underlying patterns in Stein’s texts that esthetic relationship with texts this way: ‘I feel with
often tell much of the story. The presence of these my eyes and it does not make any difference to me
patterns would seem to suggest that further investi- what language I hear, I don’t hear a language, I hear
gation into the similarities between these documents tones of voices and rhythms, but with my eyes I see
and the essentially mixed nature of identity and words and sentences’ (Stein, 1990, p. 70); and, in
textual construction in Stein’s Three Lives would another piece, she wonders, ‘Did one see sound, and
be productive. what was the relation between color and sound, did
it make itself by description by a word that meant it
or did it make itself by a word in itself’ (Stein, 1988,
4 Conclusion p. 191). Likewise, the ProseVis project creates a space
for distant listening by establishing relationships be-
A third research question this study of ‘Melanctha’ tween sound and color, color and text, and text and
provokes for consideration is the extent to which sound. It at once quantifies and unifies modes of
this work furthers similarity analysis in author attri- signification, allowing for a new perspective on
bution studies. A form of attribution that continues how we make meaning. Charles Bernstein writes

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 599


T. Clement et al.

that close listening remind us that ‘individual im- Clement, T., Tcheng, D., Auvil, L., Capitanu, B., and
pulses need substantiality before unifying them can Monroe, M. (2013). Sounding for Meaning: Using
generate much dynamism’ and ‘the near language- Theories of Knowledge Representation to Analyze Aural
like qualities of the musics of writing . . . gives them Patterns in Texts. Digital Humanities Quarterly 7.1.
an outwardly blinking and scanning and surfing Dawson, S. M. (1913). A Confederate Girl’s Diary.
involvement with a body politic or political economy Cambridge, MA: The Riverside Press. Documenting
of sense’ (Bernstein, 1998, p. 83). Likewise, tools that the American South. http://docsouth.unc.edu/fpn/
dawson/menu.html (accessed 11 November 2012).
provide for readerly interactions such as the kind of
distant (and close) listening (and reading) we have Diederich, J., Kindermann, J., Leopold, E., and
outlined here can advance the sensitivity of systems Paass, G. (2000). Authorship attribution with support
vector machines. Applied Intelligence, 19(1–2):
that use algorithms such as similarity metrics, but
109–23.
more importantly, we advance researchers’ under-

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


standings of the how and the what and the whole Eder, M. (2012). Mind your corpus: systematic errors in
authorship attribution. In Digital Humanities Book of
those blinking metrics represent.
Abstracts. Hamburg: Annual Digital Humanities
Conference. http://www.dh2012.uni-hamburg.de/
conference/programme/abstracts/mind-your-corpus-
Funding systematic-errors-in-authorship-attribution/ (accessed
November 11, 2012).
This work was supported by the Andrew W. Mellon Forsyth, R. and Holmes, D. (1996). Feature-finding for
Foundation [grant number 31000682]. text classification. Literary and Linguistic Computing,
11(4): 163–74.
Fullbrook, K. (1990). Free Women: Ethics and Aesthetics
in Twentieth-Century Women’s Fiction, 1st edn.
References Philadelphia, PA: Temple University Press.
Baayen, R., van Halteren, H., and Tweedie, F. (1996).
Gamon, M. (2004). Linguistic correlates of style: Authorship
Outside the cave of shadows: Using syntactic annota-
classification with deep linguistic analysis features.
tion to enhance authorship attribution. Literary and
Proceedings of the 20th International Conference on
Linguistic Computing, 11: 121–31.
Computational Linguistics. Morristown, NJ: Association
Bernstein, C. (1998). Close Listening: Poetry and the for Computational Linguistics, pp. 611–17.
Performed Word. New York: Oxford University Press.
Grieve, J. (2007). Quantitative authorship attribution: an
Bridgman, R. (1970). Gertrude Stein in Pieces. New York: evaluation of techniques. Literary Linguistic Computing,
Oxford University Press. 22(3): 251–70.
Burrows, J. F. (1987). Computation into Criticism: A Hoover, D. L. (2003a). Another perspective on vocabulary
Study of Jane Austen’s Novels and an Experiment in richness. Computers and Humanities, 37(2): 151–78.
Method. Oxford: Clarendon Press.
Hoover, D. L. (2003b). Multivariate analysis and the
Burrows, J. (2002). ‘Delta’: A measure of stylistic differ- study of style variation. Literary Linguistic Computing,
ence and a guide to likely authorship. Literary and 18(4): 341–60.
Linguistic Computing, 17: 267–87.
Horton, G. M. (1845). The Poetical Works of George M.
Brinnon, J. M. (1959). The Third Rose: Gertrude Stein and Horton: The Colored Bard of North Carolina: To
Her World. New York: Atlantic-Little Brown. Which is Prefixed the Life of the Author, Written by
Cohen, M. A. (1984). ‘‘Black Brutes and Mulatto Saints: Himself. Hillsborough, NC: Documenting the
The Racial Hierarchy of Stein’s ‘Melanctha’ ’’. Black American South. Printed by D. Heartt. http://doc
American Literature Forum, 18(3): 119–21. Autumn. south.unc.edu/fpn/hortonpoem/menu.html (accessed
Collins, J., Kuafer, D., Vlachos, P., Butler, B., and 11 November 2012).
Ishizaki, S. (2004). Detecting collaborations in text Jockers, M. L. and Daniela, M. W. (2010). A comparative
comparing the authors’ rhetorical language choices in study of machine learning methods for authorship
the federalist papers. Computers and the Humanities, attribution. Literary and Linguistic Computing, 25.2:
38(1): 15–36. 215–24.

600 Literary and Linguistic Computing, Vol. 28, No. 4, 2013


Distant listening to Gertrude Stein’s ‘Melanctha’

Juola, P. (2004). Ad-hoc authorship attribution competi- Morristown, NJ: Association for Computational
tion. Proceedings of the Joint Conference of the Linguistics, pp. 482–491.
Association for Computers and the Humanities and the Saldı́var-Hull, S. (1989). ‘Wrestling Your Ally: Stein,
Association for Literary and Linguistic Computing. Racism, and Feminist Critical Practice’. In Lynn, M. B.
Goteborg, Sweden, pp. 175–6. and Ingram, A. (eds), Women’s Writing in Exile. Chapel
Juola, P., Sofko, J., and Brennan, P. (2006). A prototype Hill, NC: University of North Carolina, pp. 181–98.
for authorship attribution studies. Literary Linguistic Smedman, L. (1995). ‘‘Cousin to Cooning’’: Relation,
Computing, 21(2): 169–78. Difference, and Racialized Language in Stein’s
Koppel, M., Schler, J., and Bonchek-Dokow, E. (2007). Nonrepresentational Texts. MFS Modern Fiction
Measuring differentiability: unmasking pseudonymous Studies, 42: 569–588.
authors. Journal of Machine Learning Research, 8: Smith, S. (1993). Subjectivity, Identity, and the Body:
1261–76. Women’s Autobiographical Practices in the Twentieth

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Malone, B. Y. (1919). The Diary of Bartlett Yancey Century. Bloomington: Indiana University Press.
Malone. Chapel Hill: University of North Carolina. Stamatatos, E., Fakotakis, N., and Kokkinakis, G.
Documenting the American South. http://docsouth. (2000). Automatic text categorization in terms of
unc.edu/fpn/malone/menu.html (accessed 11
genre and author. Computational Linguistics, 26(4):
November 2012).
471–495.
Margaret Ann Meta Morris Grimball. (1818–1881).
Stamatatos, E., Fakotakis, N., and Kokkinakis, G.
Journal of Meta Morris Grimball: South Carolina,
(2001). Computer-based authorship attribution with-
Deceber 1860-February 1866. TS. UNC-Chapel Hill,
out lexical measures. Computers and the Humanities,
Southern Historical Collection.
35(2): 193–214.
Martindale, C. and Mckenzie, D. (1995). On the utility of
Stamatatos, E. (2009). A survey of modern authorship
content analysis in author attribution: The Federalist.
attribution methods. Journal of the American
Computers and the Humanities, 29(4): 259–70.
Society for Information Science and Technology, 60(3):
McLeary, A. C. (1902). Humorous Incidents of the Civil 538–56.
War [n.p.]. Documenting the American South. http://
Stein, G. (1990). The Autobiography of Alice B. Toklas.
docsouth.unc.edu/fpn/mcleary/menu.html (accessed
New York: Vintage Books.
11 November 2012).
Stein, G. (1988). ‘‘Portraits and Repetition’’. Lectures in
Moretti, F. (2000). Conjectures on World Literature. New
America. London: Virago, pp. 165–206.
Left Review, (1):54–68.
Stein, G. (2004). Three Lives. Whitefish, Montana:
Mosteller, F. and Wallace, D. (1964). Inference and
Kessinger Publishing.
Disputed Authorship: The Case of the Federalist Papers.
Reading, MA: Addison-Wesley. Sutton, C., Sindelar, M., and McCallum, A. (2005).
Feature bagging: Preventing weight undertraining in
Peterson, C. L. (1996). The Remaking of Americans:
structure discriminative learning. In Structured
Gertrude Stein’s ‘Melanctha’ and African-American
Musical Traditions. In Wonham, H. B. (ed.), Discriminative Learning. CIIR Technical Report.
Criticism and the Color Line: Desegregating American Amherst, MA: University of Massachusetts.
Literary Studies. New Brunswick, NJ: Rutgers UP, Uzuner, O. and Katz, B. (2005). A comparative study of
pp. 140–57. language models for book and author recognition.
Pringle, E. W. A. (1914). A Woman Rice Planter. C. Lecture Notes in Computer Science. Berlin: Springer.
1913. New York: The Macmillan Company. Veney, B. (1889). The Narrative of Bethany Veney: A
Documenting the American South. http://docsouth. Slave Woman. Boston: Press of Geo. H. Ellis.
unc.edu/fpn/pringle/menu.html (accessed 11 Documenting the American South. http://docsouth.
November 2012). unc.edu/fpn/veney/menu.html (accessed 11
Sanderson, C. and Guenter, S. (2006). Short text author- November 2012).
ship attribution via sequence kernels, Markov chains Ward, D. T. (2012). The Last Flag of Truce. Franklinton,
and author unmasking: An investigation. In NC: D.T. Ward. Documenting the American South.
Proceedings of the International Conference on http://docsouth.unc.edu/fpn/ward/menu.html (ac-
Empirical Methods in Natural Language Engineering. cessed 11 November 2012).

Literary and Linguistic Computing, Vol. 28, No. 4, 2013 601


T. Clement et al.

Weiss, S. M., Indurkhya, N., Zhang, T., and boundary, and a paragraph-final boundary, is particu-
Damerau, F. (2005). Text Mining: Predictive Methods larly important because phrasal boundaries determine
for Analyzing Unstructured Information. New York: the rise and fall or emphases of particular words based
Springer. on their context within the phrase.
Yu, B. (2008). An evaluation of text classification methods 3 The texts by Gertrude Stein include Four Saints in
for literary study. Literary and Linguistic Computing, 23: Three Acts, ‘Matisse’, The Making of Americans, ‘Miss
327–43. Furr and Miss Skeene’, ‘Picasso’, Three Lives, and
Zhang, D. and Lee, W. S. (2006). Extracting key-sub- Tender Buttons. All of these texts are freely available
string-group features for text classification. In online from Project Gutenberg. The Making of
Proceedings of the 12th Annual SIGKDD International Americans edition was published by Dalkey Archive
Conference on Knowledge Discovery and Data Mining. Press (1995).
New York: ACM Press, pp. 474–83. 4 Some of the FPN documents have multiple authors.
We do not include illustrators in this count.

Downloaded from http://llc.oxfordjournals.org/ at University of Pittsburgh on January 14, 2015


Zhao, Y. and Zobel, J. (2005). Effective and scalable 5 The webform can be found at http://tclement.ischool.
authorship attribution using function words. Lecture
utexas.edu/ProseVis/data/.
Notes in Computer Science. Berlin: Springer.
6 We have created a Meandre service for this
backend analysis. Once the analysis is complete, the
researcher receives an email with urls to download
Notes the results.
1 In this study, we use the term ‘phrase window’ to 7 An important characteristic is determining the relation-
mean a window of sounds that have been produced ship between the number of texts a user is exploring
in the pre-processing stage. The size of the window, and the window size in sounds. We have determined
which in our system defaults to ‘8’, can be set by that as the number of examples increases, so too should
the user. the weighting power. Viewed from a k-nearest-
2 The break index, which marks the boundaries of neighbor perspective, this amounts to keeping the
syntactic units such as an intermediate phrase break, neighborhood size constant.
an intra-sentential phrase break, a sentence-final 8 http://docsouth.unc.edu/fpn/index.html

602 Literary and Linguistic Computing, Vol. 28, No. 4, 2013

You might also like