You are on page 1of 15

Visualization in stylometry: Cluster

analysis using networks

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


............................................................................................................................................................
Maciej Eder
Pedagogical University of Kraków, Poland
Institute of Polish Language, PAS
.......................................................................................................................................
Abstract
The aim of this article is to discuss reliability issues of a few visual techniques
used in stylometry, and to introduce a new method that enhances the explanatory
power of visualization with a procedure of validation inspired by advanced stat-
istical methods. A promising way of extending cluster analysis dendrograms with
a self-validating procedure involves producing numerous particular ‘snapshots’,
Correspondence: or dendrograms produced using different input parameters, and combining them
Maciej Eder, Institute of all into the form of a consensus tree. Significantly better results, however, can be
Polish Studies, Pedagogical obtained using a new visualization technique, which combines the idea of nearest
University of Kraków, ul. neighborhood derived from cluster analysis, the idea of hammering out a
_
Podchora˜ zych 2, 30-084
Kraków, Poland.
clustering consensus from bootstrap consensus trees, with the idea of mapping
E-mail: textual similarities onto a form of a network. Additionally, network analysis
maciejeder@gmail.com seems to be a good solution for large data sets.
.................................................................................................................................................................................

1 Introduction algorithms, suitable for classification tasks, derived


mostly from the field of biometrics, nuclear physics,
Most of the computational methods used in stylom- or software engineering, that could be easily
etry have been originally introduced to solve adopted to authorship attribution. They include
authorship attribution problems. This fact had an naı̈ve Bayes classification, support vector machines,
immense influence on the further development of nearest shrunken centroids, or random forests, to
the whole discipline. The seminal study by Mosteller name but a few (Mosteller and Wallace, 2007
and Wallace (2007 [1964]) showed in a very con- [1964]; Jockers et al., 2008; Koppel et al., 2009,
vincing way that authorship attribution based on Tabata, 2012).
statistical analysis of style is ultimately the problem Independently, a ground-breaking monograph on
of classification. In its standard form, attribution is Jane Austen published by Burrows (1987) ushered
aimed at extracting a unique authorial profile from stylometry into literary criticism. It turned out that
a disputed text and from texts written by possible from a literary perspective, matching profiles of ‘can-
‘candidates’; the goal is to compare the profiles and didates’ is not as important as obtaining a broader
to single out the matching ‘candidate’. Even if one picture of relations between different novels, types of
deals with an open-set attribution case—where the narration, main characters’ voices, and so forth. The
list of possible candidates cannot be reliably estab- methods adopted or introduced by Burrows,
lished—the general idea does not differ substantially Hoover, Craig, and others (Burrows, 1987, 2002,
from other classification problems. 2007; Hoover, 2003a, b; Craig and Kinney, 2009)
Exact science has developed a number of were very intuitive and easily-applicable to literary
well-performing, sophisticated machine-learning studies. These include principal components

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017. ß The Author 2015. Published by Oxford University 50
Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com
doi:10.1093/llc/fqv061 Advance Access published on 2 December 2015
Visualization in stylometry

analysis, multidimensional scaling, cluster analysis, re-composition to test whether the attribution ac-
Delta, Zeta, and Iota. Despite their limitations (the curacy depends on particular constellation of texts
lack of validation of the obtained results being the used in the analysis (Eder and Rybicki, 2013), a
most obvious), they are still widely used. study aimed to examine the performance of untidily

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


The reason of their popularity is that they meet prepared corpora (Eder, 2013a), and so forth.
the needs of literary scholars, also because they offer Sophisticated machine-learning methods of clas-
convincing visualizations. sification routinely try to estimate the amount of
Needless to say, visualization has an undeniable potential error that may be due to inconsistencies
explanatory power. Scatterplots, maps, trees, and in the analyzed corpus. A standard solution here is
diagrams provide an insight into the whole corpus a 10-fold cross-validation, or 10 random swaps be-
at one glance. Moreover, they allow to draw conclu- tween two parts of a corpus: a subset of training texts
sions about literature from a distant-reading and a subset of texts used in the testing procedure.
perspective, through a visual interpretation of Most unsupervised methods used in stylometry,
groupings and separations of several samples. such as principal components analysis, multidimen-
Certainly, this is particularly desired in stylometry sional scaling, or cluster analysis, lack this important
beyond authorship attribution. The attractiveness of feature. On the other hand, however, the results ob-
visualization in computational literary criticism is tained using these techniques ‘speak for themselves’,
confirmed not only by the aforementioned studies which gives a practitioner an opportunity to notice
by Burrows or Hoover, but also by immense popu- with the naked eye any peculiarities or unexpected
larity of beautiful yet relatively simple plots pre- behavior in the analyzed corpus. Also, given a tree-
sented by Moretti, Jockers, Posavec, and others like graphical representation of similarities between
(Morretti, 2005; Posavec, 2007; Jockers, 2013; particular samples, one can easily interpret the
Sinclair and Rockwell, 2014). The aim of this article results in terms of finding out the group of texts
is to discuss reliability issues of a few visual tech- to which a disputed sample belongs.
niques, and to enhance the explanatory power of Hierarchical cluster analysis—as discussed in the
visualization with a procedure of validation inspired present study—is a technique which tries to find the
by advanced statistical methods. most similar samples (e.g. literary texts) and builds a
hierarchy of clusters, using a ‘bottom-up’ approach.
What makes this method attractive is the very in-
2 Reliability in Computational tuitive way of graphical representation of the ob-
1 tained results: contrarily to the scatterplots as
Stylistics produced by multidimensional scaling or principal
components analysis, where the goal is to interpret
The question of reliability in non-traditional
relative positions of several points settled on a rect-
authorship attribution has been extensively dis-
angular plot, cluster analysis produces explicit links
cussed by Rudman (1998a,b, 2003), who formulated
between neighboring items (see Figs 1–4). However,
a number of caveats concerning corpus preparation,
despite obvious advantages, some problems still
sampling, selection of style-markers, interpreting
remain unresolved. The final shape of a dendro-
the results, etc. Rudman’s fundamental remarks,
gram highly depends on many factors, the most
however, have not been preceded by empirical in-
important being (1) the particular distance meas-
vestigation. Experimental approaches to the prob-
ure applied to the data, (2) the algorithm of group-
lem of reliability include an application of recall/
ing the samples into clusters, and (3) the number
precision rates as a way of assessing the level of
of variables (e.g. the most frequent words) to be
(un)certainty (Koppel et al., 2009), a study on dif-
analyzed. These factors will be briefly discussed
ferent scalability issues in stylometry (Luyckx,
below.
2010), a paper discussing the short sample effect
and its impact on authorship attribution reliability (1) In a study of multivariate text analysis using
(Eder, 2015), an experiment using intensive corpus dendrograms, Burrows concludes, ‘my many

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 51


M. Eder

trials suggest that, for such data as we are than 100 samples: for the sake of speed, the
examining, complete linkages, squared optimal clustering was not a priority (Ward,
Euclidean distances, and standardized vari- 1963, p. 236).
ables yield the most accurate results’ (3) Even if some issues still remain unresolved,

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


(Burrows, 2004, p. 326). The distance used scholars roughly agree that Euclidean (nor-
by Burrows is a widely accepted solution in malized) distance and Ward’s linking algo-
the field of computational stylistics; there are rithm provide acceptable results. However,
no studies, however, that would satisfactorily the same cannot be said about the third
explain the principles of using this particular factor cluster analysis depends on, which is
measure. Presumably, ‘standardized variables’ the number of features (e.g. frequent words)
mean, in this context, relying on z-scores (i.e. to be analyzed, and the type of countable
scaled values) rather than on relative word features (e.g. words, word n-grams).
frequencies. If this is true, the distance used
here is in fact equivalent to the Linear Delta The question how many features should be used
measure introduced by Argamon (2009, for stylometric tests has been approached in many
p. 134), a slightly modified version of the clas- studies, but no consensus has been achieved: some
sic Delta measure as developed by Burrows scholars suggest using a small number of carefully
(2002). There is no denying that Delta, and selected words (often, function words), others prefer
ipso facto the distance measure embedded in long vectors of words, and so on. Although all these
it, proved to be very effective—a fact con- solutions are reasonable and theoretically justified,
firmed by numerous stylometric studies; the final choice of the number of features to analyze
thus, it should be also applicable to hierarch- is a priori arbitrary. This problem is sometimes
ical cluster analysis procedure. Even if convin- referred to as ‘cherry-picking’ (Rudman, 2003).
cing at first glance, however, the choice of this Awareness of this issue, followed by partial solution,
particular measure needs to be theoretically can be observed in the studies by Hoover (2003a, b),
justified and confirmed by empirical compari- who assesses a given corpus with a few discrete cluster
sons with other distances. analyses for different most frequent word (MFW)
(2) Another factor affecting the final shape of a values. Even if still subject to arbitrary choices, this
dendrogram is the method of linkage used. In approach gives a fairly good insight into variability of
the above-cited statement, Burrows favors the the input data. This way of dealing with uncertainty
complete linkage algorithm as the most effect- will be discussed below in detail, with its possible
ive one. We do not know, however, which extension to other visualization techniques.
were the other algorithms considered by
Burrows, and we do not know what method
of comparison was used to test their effective-
ness. In a similar study, Hoover argues that 3 Multilayer Model of Written Text
the best performance is provided by Ward’s
linkage (Hoover, 2003b); his claim is con- As will shortly be demonstrated, even the slightest
firmed by a concise comparison of Ward’s, change in the experiment setup might cause a severe
complete, and average linkages. Good per- reshaping of the final dendrogram. Without decid-
formance of Ward’s method has been also ing which of the three factors discussed in the pre-
proven in many other applications within vious section—linkage algorithm, distance measure,
the field of quantitative linguistics, corpus lin- and the number of words analyzed—is more likely
guistics, and related disciplines. Although it to affect the final shape of a dendrogram, one must
seems to be accurate indeed, there is no admit that the first two are related to the method of
awareness, however, that this method has clustering, while the third factor is inherently linked
been designed for large-scale tests of more to certain linguistic features of analyzed texts.

52 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Endless discussions of how many frequent words literary tradition (e.g. sources of inspiration), and
or n-grams should be taken into account (e.g. probably many more. Arguably, literary quality
Mosteller and Wallace, 1964; Hoover, 2003a; somehow depends on education, genre depends on
Burrows, 2007; Koppel et al., 2009; Eder, 2013b; topic, authorial voice is affected by chronology,

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Schöch, 2013) show rather clearly that there is no gender affects personality, and so on. Some layers
universal frequency strata where the authorial fin- might be barely noticeable, and some others might
gerprint is hidden. Just the opposite, it seems that become surprisingly strong. In authorship attribu-
the authorial signal is spread throughout the whole tion, this complex system of uncontrollable layers
frequent and not-so-frequent words spectrum, but is a problem of unwanted noise, and in literary-
at the same time it may become obscured by add- oriented computational stylistics, an opportunity
itional and unpredictable signals, which are con- to see more.
sidered noise in classical approaches to attribution.
In stylometry beyond attribution, however, this
‘noise’ is worth a closer look. Why are some authors
misclassified? Which texts are wrongly attributed to 4 Dendrogram, or One Snapshot at
a given author, and why are they linked to this very a Time
author and not to others? These and similar ques-
tions are probably much more interesting than the Since particular frequency strata are responsible, to
never-ending fine-tuning of the parameters of this some extent, for different signals hidden in a literary
or that classification algorithm in order to neutralize text, the dendrograms generated using longer or
the impact of the ‘noise’. shorter MFW vectors presumably will also be heter-
Obviously, the problem is not new. Cross-genre ogenous. And they actually are (Figs 1–4); the only
authorship attribution, for one, has always been a problem is that their variability is much bigger than
major challenge (Kestemont et al., 2012; Schöch, one could expect and—what is worse—the changes
2013). Also, there have been a few attempts to ex- in dendrograms’ shapes are unpredictable. Different
tract particular signals hidden in texts: author’s na- combinations of linkage algorithms, number of
tionality (Jockers, 2013), psychologic profile MFWs, and distance measures applied, one obtains
(Noecker et al., 2013), gender (Pennebaker, 2011), a convincing example of how unstable the final
genre (Koppel et al., 2009), and translator’s finger- results might be.
print (Rybicki and Heydel, 2013). On theoretical Worth noticing, however, that the authorial
grounds, function words should be responsible for ‘leaves’ on the dendrograms are usually correctly
authorial recognition, while content words should clustered regardless of the parameters used. In
be more topic- and genre-related. The Fig. 1 (Ward’s linkage, 100 MFWs), most of the
abovementioned empirical studies, however, do authors are recognized to be stylistically homogen-
not really confirm this assumption. There is no ous; the exceptions include Charles Dickens and
clear rule here, and the same words are sometimes Henry James. When the number of features
claimed to reveal different signals. For instance, the increases to 300 MFWs, the ‘leaves’ of the dendro-
definite article ‘the’ is considered to discriminate gram are matched with no misattributions (Fig. 2).
British versus American flavors of English in one In any attempts to visualize larger groupings of
study (Jockers, 2013, p. 105), and female versus texts, however, one needs to admit that the
male language in another (Pennebaker, 2011, p. 42). ‘branches’ are significantly less predictable than
The difficulties with separating one specific signal the ‘leaves’: is Galsworthy stylometrically similar to
suggest that a text (written or spoken) is a multi- George Eliot or to Joseph Conrad? Is Thackeray
layer phenomenon, in which particular layers are linked to Walter Scott or to Charles Dickens?
correlated. These layers include authorship, chron- What does the main division into two large clusters
ology, personality, gender, topic, education, literary mean? Figures 1–4 might support many contradict-
quality, translation (if applicable), intertextuality, ory hypotheses.

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 53


M. Eder

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 1. Cluster analysis of 66 English novels, 100 MFWs, classic Delta distance, Ward’s linkage. Color versions of all
figures are available online.

Fig. 2. Cluster analysis of 66 English novels, 300 MFWs, classic Delta distance, Ward’s linkage.

The problems do not end here: a detailed inspec- (Fig. 3). Almayer’s Folly jumps back from Kipling’s
tion of multiple dendrograms generated for grad- branch to Conrad’s cluster exactly between the
ually increasing number of features (MFWs) shows words 969 and 970 on the frequency list (Fig. 4).
that substantial rearrangements might occur quite The knowledge that this 970th word is ‘wine’ does
suddenly. An example of this behavior is shown in not help much, however, since multivariate analyses
Figs 3 and 4. Cluster analysis using McQuitty’s link- take into consideration a great number of features at
age and 136 MFWs (not shown) reveals a perfect a time. The word ‘wine’, not very discriminative
authorial recognition, but when 137 MFWs are itself, was the factor to tip the scale in favor of
used, the cluster for Joseph Conrad is split into Conrad. What is more important here is the side-
two parts and remains detached (along many effect: apart from the local Kipling/Conrad change,
other substantial rearrangements of the corpus) the whole dendrogram has been severely affected
until the same corpus is assessed at 969 MFWs and, in consequence, significantly reshaped. Such

54 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 3. Cluster analysis of 66 English novels, 969 MFWs, classic Delta distance, McQuitty’s linkage.

Fig. 4. Cluster analysis of 66 English novels, 970 MFWs, classic Delta distance, McQuitty’s linkage.

abrupt changes seem to be a rule rather than the will be probably dropped simply because they do
exception, at least for textual data sets. not fit the scholars’ expectations. An interesting vari-
The decision which of the dendrograms presented ant of cherry-picking is discussed by Vickers, who
above reveal the actual separation of the samples and writes about the ‘visual rhetoric’ of different lines,
which show fake similarities is not trivial at all. arrows, colors, and so forth added to a graph; while
Generating hundreds of dendrograms covering the helpful, at the same time they suggest apparent
whole spectrum of MFWs, a variety of linkage algo- separations of samples (Vickers, 2011, p. 127).
rithms, and a number of distance measures, would
make this choice even more difficult. At this point, a
stylometrist inescapably faces the abovementioned 4 Consensus Tree, or Many
cherry-picking problem (Rudman, 2003). When it Dendrograms Combined
comes to choosing the plot that is the most likely
to be ‘true’, scholars are often in danger of more or A partial solution of the cherry-picking problem
less unconsciously picking the one that looks more involves combining the information revealed by
reliable than others, or that simply confirms their numerous dendrograms into a single consensus
hypotheses. If common sense is used to evaluate plot. This technique has been developed in phylo-
the obtained plots, any counter-intuitive results genetics (Paradis et al., 2004) and later used to assess

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 55


M. Eder

differences between Papuan languages (Dunn et al., however, other limits of consensus tree approaches
2005). It has been also introduced into stylometry become painful, especially when the number of ana-
(Eder, 2013b) and applied in a number of stylomet- lyzed texts increases. The technique introduced
ric studies (Rybicki, 2012; Rybicki and Heydel, 2013; below is aimed at overcoming these limits.

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


van Dalen-Oskam, 2014). This approach assumes
that, in a large number of ‘snapshots’ (e.g. for 100,
200, 300, 400,. . ., 1,000 MFWs), actual groupings 5 Consensus Network, or
tend to reappear, and apparent similarities are
likely to remain accidental. The goal, then, is to
Importance of Runners-Up
capture the robust patterns across a set of generated
Although the problem of unstable results can be
snapshots. The procedure is aimed at producing a
partially by-passed using consensus techniques,
number of virtual dendrograms, and then at evalu-
two other issues remain unresolved. Firstly, when
ating robustness of groupings across these dendro- the number of analyzed samples exceeds a few
grams. If a given link—say, between Richardson’s dozen, the plot becomes cluttered and thus illegible.
Pamela and Fielding’s Tom Jones—turns out to Secondly, the procedure of hammering out the con-
appear frequently enough, it is reproduced on a sensus is aimed at identifying nearest neighbors
consensus plot. In other words, several regular (yet only, which means extracting the strongest patterns
virtual) dendrograms ‘vote’ for the most robust (usually, the authorial signal) and filtering out
links—the procedure summarizes the information weaker textual similarities. Consequently, samples
on clustering from particular plots. on a consensus tree are very likely to be grouped
In Fig. 5, a consensus tree of the corpus of 66 into many discrete authorial clusters rather than
English novels has been shown (the ‘snapshots’ were into a few larger branches. When the number of
computed for 100, 200, 300, etc. up to 1,000 analyzed texts is considerably small, the granulation
MFWs). Some text groupings can be easily identi- of clusters is barely noticeable (Fig. 5); in large cor-
fied, including, among others, an expected cluster of pora, however, numerous little branches are linked
the three Brontë sisters, and a branch of Kipling/ directly to the root of the dendrogram. Useful in
Conrad—clearly subdivided into two distinct explanatory authorship attribution, such a plot
authorial voices. Unlike typical dendrograms, how- will not support stylometric interpretations of simi-
ever, the established links do not represent stylomet- larities between texts, authors, genres, styles or lit-
ric distances between samples. Instead, they indicate erary epochs. Arguably, large-scale stylometry will
the strength of the consensus, or the repetitiveness be interested in deeper textual relations rather
across a number of virtual ‘snapshot’ dendrograms. than in mere nearest neighborhood.
Upgrading the procedure from a cherry-picked To overcome the two aforementioned issues, it
cluster analysis into a consensus tree is a significant seems reasonable to leverage the idea of consensus,
step toward reliable stylometry. Such a tree captures in terms of embedding it into a flexible way of visu-
the average behavior of a corpus for a given fre- alization. Techniques of network analysis seem to be
quency strata (in this case, 100–1,000 MFW). particularly promising.
More importantly, it filters out local disturbances The concept of network has already been used to
(artifacts) that could otherwise be considered as assess linguistic data: the applications included an
valid results. Some arbitrary decisions cannot be analysis of syntactic structures in English (Cancho i
avoided, though. They include the number of fea- Ferrer, 2005), syntactic structures in Czech, German
tures to be assessed, the number of iterations (‘snap- and Romanian (Cancho i Ferrer et al., 2004), com-
shots’) to produce a consensus tree, and—last but monly occurring English adjectives and nouns
not least—the linkage algorithm embedded in the (Newman, 2006, p. 14), word associations (Lai et
whole procedure. A considerably simple way to neu- al., 2004; Lancichinetti, 2011, p. 17). Network ana-
tralize these issues is to reproduce a given experi- lysis has been also used to compare differences be-
ment using different settings. Sooner or later, tween several texts in a corpus, namely, to

56 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 5. Consensus tree of 66 English novels, 100–1,000 MFWs, classic Delta distance, Ward’s linkage.

investigate the process of word network growth the nearest neighbor of the disputed sample. To do
given a number of n sequences (Caldeira et al., this, stylometric distance between each pair of sam-
2006), and recently to visualize relations in a ples is estimated, and then the texts are ordered from
corpus of a few hundred English novels (Jockers, the most to the least similar. To give an example: in
2013). The method introduced below is somewhat the case of The Jungle Book by Kipling, the ranking
inspired by these studies. It relies on the assumption begins with Kim (the nearest neighbor), the next is
that particular texts can be represented as nodes of a Captains Courageous, then Lord Jim by Conrad, and
network, and their explicit relations as links between so on, and the last place in this procession is given to
these nodes. The most significant difference, how- Gulliver’s Travels by Swift. Each text in the corpus is
ever, between the approaches applied so far and the associated with its own ranking of neighbors, from
present study is the way in which the nodes are the nearest to the farthest one.
linked. This new procedure of linking is two-fold: Now, these rankings can be reused to produce a
one of the involved algorithms computes the dis- stylometric network. In a simple variant, the links
tances between analyzed texts, the other is respon- would be established between nearest neighbors
sible for establishing a consensus of links. only: Kipling’s The Jungle Book connected to Kim,
A typical approach to authorship attribution in- Hardy’s Far from the Madding Crowd connected to
volves a comparison of a disputed (anonymous) Jude the Obscure, and so forth. However, since in
sample against a reference corpus, in order to identify literature-oriented studies, weaker or hidden textual

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 57


M. Eder

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 6. Two algorithms of mapping textual relations: establishing weighted links to a nearest neighbor and two runners-
up (top); producing a consensus network (bottom).

relations are potentially more interesting than expli- an implementation of the idea of consensus dendro-
cit similarities, it makes sense to use the rankings grams as discussed above into network analysis. The
more extensively. In stylometric terms, it means that goal is to perform a large number of tests for simi-
runners-up (i.e. a few texts that have been ranked larity with different number of features analyzed
immediately after the nearest neighbor) should not (e.g. 100, 200, 300, . . . , 1,000 MFWs). Finally, all
be excluded from the analysis, even if, in typical the connections produced in particular ‘snapshots’
approaches to classification, these runners-up are are added, resulting in a consensus network.
considered as unwanted noise and routinely filtered Weights of these final connections tend to differ
out. significantly: the strongest ones mean robust nearest
Let the algorithm establish, then, for every single neighbors, while weak links stand for second-
node, a strong connection to its nearest neighbor ary and/or accidental similarities. Validation of the
(i.e. the most similar text), and two weaker connec- results—or rather self-validation—is provided by
tions to the 1st and the 2nd runner-up. The outline the fact that consensus of many single approaches
of the algorithm is represented in Fig. 6 (top). to the same corpus sanitizes robust textual simila-
Consequently, the final network will contain a rities and filters out apparent clusterings.
number of weighted links, some of them being The two algorithms combined, one is presented
thicker (close similarities), some other revealing with a robust picture of actual (strong) clusterings,
weaker connections between samples. Arguably, in emerging from an ethereal web of weaker stylistic
most literary analyses, the thick connections will similarities in the background. The above two-fold
betray authorial similarities (usually the strongest procedure of linking is implemented in the package
stylometric signal), while thin links will reflect ‘stylo’, an open-source stylometric library written in
hidden layers of subtle intertextual correlations. In the R programing language (R Core Team, 2013)
this article, it is assumed that three neighbors—a and available at CRAN repository (http://cran.r-
nearest one and its two runners-up—provide project.org).2
enough information about weaker similarities. The next crucial step in network analysis is to
However, one can set any number of neighbors to arrange the nodes on a plane in such a way that
be connected. An empirical comparison of different they reveal as much information about linkage as
ways of connecting the nodes will be discussed in a possible. Apart from very small networks that can be
separate study. arranged manually, usually an algorithmic layout
The second algorithm (Fig. 6, bottom) is aimed is applied. In the present study, one of the force-
at overcoming the problem of unstable results. It is directed layouts was chosen, namely the algorithm

58 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 7. Consensus network of 66 English novels: classic Delta distance, 100–1,000 MFWs, modularity 0.5.

ForceAtlas2 embedded in GEPHI, an open-source the predominance of authorial signal in the data set.
tool for network manipulation and visualization What is more interesting, however, is the relations
(Bastian et al., 2009). Force-directed layouts between particular authorial clusters—and this is one
perform gravity-like simulation and pull the most- notable advantage of networks over consensus trees.
connected nodes (i.e. the ones that have several links The outliers include Austen, Trollope, James, and
and/or their links are very strong) to the center of Conrad, while the central parts are occupied mostly
the network, while the least connected nodes are by the works of Dickens and Sterne. A circle of
pushed outside. immediate satellites formed by Hardy, Galsworthy,
A network produced using the above procedure the Brontës, Richardson, Fielding, and Thackeray is
is fairly informative per se: it usually reveals some also noteworthy. Moreover, modularity-based
clusterings discoverable with the naked eye, color assignment sheds new light onto the already-
some centrally located nodes as well as peripheries, interesting picture: while different works of a given
some denser and sparser areas, and so forth. At the author are usually recognized to form a distinct
same time, however, such a network can be group, notable exceptions include a common cluster
subjected to a variety of standard measures used for Richardson, Fielding, Swift, and Scott; another
in networks analysis, which make the interpretation common cluster is formed by the Brontë sisters,
of the results more complete. These include meas- and the Dickensian oeuvre is split into two discrete
ures of network size, its density, centrality of the groups (quite well connected with each other,
nodes (closeness, betweenness, degree), and others. though). Last but definitely not least, the network
The measure of modularity, used as a community clearly shows a chronological pattern undiscoverable
detection tool, might be particularly helpful to in- using consensus trees: a diagonal timeline beginning
terpret clusters of stylistically similar texts. at the left side of the network, i.e. the late 18th-
In Fig. 7, a network of 66 English novels produced century area occupied by Fielding, Richardson, and
using the above procedure is shown. Spatial arrange- Swift, through the Victorians (roughly in the
ment of the nodes was established by the said force- middle), all the way to the early modernist Joseph
directed layout, and the nodes’ colors were assigned Conrad.
according to the modularity measure. The network is Modularity is not the only way in which stylistic
clearly split into a few groups that obviously confirm properties of particular texts/nodes can be assessed.

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 59


M. Eder

Another useful yet extremely simple measure is the easily identify a few hotspots—they represent the
degree or the number of connections that a particu- ‘radiating’ hubs, or the texts from which the
lar node has. The real potential of this measure, number of outcoming links is the highest. These
however, comes on stage when the nodes are are: Dorian Gray by Wilde (12 links), Sentimental

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


re-linked to form a directed consensus network. Journey by Sterne (10), Kim by Kipling (10), Tom
Jones by Fielding (9), and Agnes Grey by Anne
Brontë (9).
6 Directed Network, or Seeking It is easy to explain the behavior of Dorian Gray
and Tom Jones, one might say, since these are the
Stylistic Hubs only novels by Wilde and Fielding, respectively,
included into the corpus. In the absence of natural
In the variant of a network discussed so far, all the
nearest neighbors—i.e. other texts written by the
connections of particular ‘snapshots’ were simply
same author—the analyzed novels blindly seek any
added, regardless of their direction. It means that
any two nodes are connected no matter if the node similarities around. On the other hand, however,
A points to B as its neighbor, or if is pointed to by B. this does not apply to Wuthering Heights, the only
It is true that in most cases the relation between the novel of Emily Brontë: she turns out to be surpris-
nodes is mutual. However, since the rankings of ingly introvert, with her mere five outcoming links,
candidates are calculated independently for every while her elder sister sends links to nine novels by
single text in a corpus, some non-symmetrical rela- Austen, Eliot, Trollope, Dickens, and Charlotte
tions might occur as well. This is particularly the Brontë. It is also surprising to see the extroversion
case when untypical texts are analyzed: such a text of Sterne’s Sentimental Journey, especially when
will point to its nearest neighbors anyway, but it compared with a very modest behavior of Tristram
would hardly be pointed to by other texts. Shandy.
Arguably, a directed network will discover such Since the procedure of linking the nodes is based
situations. on classification principles, the existence of radiat-
The procedure of establishing the connections ing hubs betrays the texts likely to be misclassified in
does not differ from the undirected variant as intro- a real-case authorship attribution study. A provi-
duced above, except that the direction of the links is sional interpretation of this phenomenon is that a
recorded. Also, any mutual relations are not given text turns into a radiating hub whenever it
summed into one connection, but kept as two in- lacks in strong authorial signal, or when its authorial
dependent links: A ! B and A B. Consequently, voice is overshadowed by other signals: genre,
every single node will have, by definition, at least gender, chronology, and so forth. Needless to say
three outcoming links pointing to the nearest neigh- that the ability of detecting radiating hubs makes
bor and to two runners-up. It is possible, however, this technique a potentially useful addition to
that a minority of well-defined nodes might send authorship attribution toolbox—as a straightfor-
numerous links in different directions, while ward way to identify unstable samples.
others would constantly point to but three neigh- From a literary point of view, however, the in-
bors. And the other way around: it is possible that coming links are potentially much more interesting,
some nodes receive a vast majority of links from the especially when they happen to form any ‘absorbing’
entire network, while other nodes remain un- hubs. Such a hub represents a text pointed out as the
pointed. In other words, measuring the number of nearest neighbor by several other texts from the
connections of particular nodes should lead to iden- corpus. Measure of incoming links, or indegree,
tifying ‘hubs’, or texts that are stylistically followed applied to the corpus of 66 English novels is repre-
(high incoming degree), and the stylistic followers sented in Fig. 9. Two major absorbing hubs can im-
(high outcoming degree). mediately be spotted; they focus on two novels by
In Fig. 8, a directed consensus network with node Dickens, David Copperfield, and Little Dorrit. Two
coloring according to outdegree is shown. One can other hotspots are also fairly noticeable, namely

60 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Fig. 8. Consensus network of 66 English novels (directed): the degree of outcoming links marked in color.

Fig. 9. Consensus network of 66 English novels (directed): the degree of incoming links marked in color.

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 61


M. Eder

Middlemarch by Eliot and Nicolas Nickleby, again by 7 Conclusions


Dickens. Poorly connected novels found their place
on the other pole of the indegree measure: Dorian In the present study, a few reliability issues of ex-
Gray by Wilde (no incoming links at all), Sterne’s planatory methods used in stylometry were dis-

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Sentimental Journey (a single yet very strong incom- cussed. They include unstable output—because
ing link from Tristram Shandy), and Swift’s Gulliver final results highly depend on the setup of the ex-
(a single strong link from A Tale of a Tub). periment—as well as lack of validation. A promising
Unlike radiating hubs, the absorbing ones are way of extending cluster analysis dendrograms with a
harder to interpret. In social sciences, physics etc., self-validating procedure involved producing numer-
the hubs are usually considered to betray the most ous particular ‘snapshots’, or dendrograms produced
important events/agents/phenomena. In stylistics, using different input parameters, and combining
however, what they really mean remains largely them all into the form of a consensus tree. This ap-
open to dispute. Jockers’s approach to the question proach, however, inherits some drawbacks of cluster
of literary influence seems to assume that the hubs analysis—dependence on a chosen linkage algorithm
indicate the most influential works (Jockers, 2013, being the most painful—and introduces a few new
pp. 154–168). Arguably, however, the picture is far pitfalls: granulation of clusters, and cluttered visual-
more complex here. ization when a corpus becomes large.
The most striking observation is that according Significantly better results were obtained using a
to the incoming links, Dickens would have had to new visualization technique, which combines the
live much earlier to have influenced Richardson, idea of nearest neighborhood derived from cluster
Sterne, or Swift. Is it the method, then, that is analysis, the idea of hammering out a clustering
wrong, or the interpretation? In the aforementioned consensus from bootstrap consensus trees, with
study on literary imitation, Jockers filters out all the idea of mapping textual similarities onto a net-
textual similarities that could not have happened work. Additionally, network analysis seems to be a
due to chronological reasons, before undertaking good solution for large data sets.
actual analysis (Jockers, 2013, p. 163). However, The added value of consensus trees over standard
discarding the backward time links cannot deny dendrograms is the reliability of the results repre-
the fact that they do appear in the corpus. sented in a plot, and the added value of stylometric
It seems reasonable to assume that the absorbing consensus networks is at least three-fold: the reli-
hubs should be interpreted as sources of stylistic ability inherited from consensus trees, insight into a
influence in a very broad sense, for instance as wit- more complete picture of textual relations beyond
nesses of stylistic mood of an entire literary epoch. It mere nearest neighborhood, and, last but not least,
is true that these hubs might indeed indicate the the capability of handling dozens, or even hundreds,
most influential texts (copied, paraphrased, of text samples in a single plot. The only limitation
sequelled, consciously/unconsciously imitated, and here seems to be the paper size one wants to use for
so forth). At the same time, however, they might drawing a literary network. Regardless of the print-
also reflect texts stylistically ‘average’, typical for ing issues, however, the aim of this study was to
their times rather than exceptional. In any case, encourage stylometrists to produce a reliable map
of literature in its entirety, and to propose a meth-
the absorbing hubs betray texts lacking in a single,
odological background for such a map.
distinct stylistic signal.
A slightly oversimplified interpretation of both
types of hubs might be as follows. The absorbing
hubs stand for receivers of stylistic appreciation Acknowledgements
(regardless of their actual stylistic quality), radiating The idea of enhancing clustering procedures with
hubs represent emitters of stylistic appreciation (not network analysis has been developed during my vis-
mere followers, though, since they do not follow a iting fellowship at the eHumanities Group (Royal
single author). Netherlands Academy of Arts and Sciences, The

62 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017


Visualization in stylometry

Netherlands): I am grateful to Sally Wyatt, Andrea Dunn, M., Terrill, A., Reesink, G., Foley, R. and
Scharnhorst, and Karina van Dalen-Oskam for the Levinson, S. (2005). Structural phylogenetics and the
many inspiring discussions we had during my stay reconstruction of ancient language history. Science, 309:
in Amsterdam. I am also grateful to the anonymous 2072–75.

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


reviewers of this article for their valuable Eder, M. (2013a). Mind your corpus: Systematic errors in
suggestions. authorship attribution. Literary and Linguistic
Computing, 28(4): 604–14.
Eder, M. (2013b). Computational stylistics and Biblical
translation: How reliable can a dendrogram be? In:
Piotrowski, T. and Grabowski, L. (eds), The
References Translator and the Computer. Wroclaw: WSF Press,
Argamon, S. (2008). Interpreting Burrows’s delta: pp. 155–70.
Geometric and probabilistic foundations. Literary and
Eder, M. (2015). Does size matter? Authorship attribu-
Linguistic Computing, 23(2): 131–47.
tion, small samples, big problem. Digital Scholarship in
Bastian, M., Heymann, S. and Jacomy, M. (2009). Gephi: the Humanities, 30(2): 167–82.
An open source software for exploring and manipulating
Eder, M. and Rybicki, J. (2013). Do birds of a feather
networks. In: International AAAI Conference on Weblogs
really flock together, or how to choose test samples for
and Social Media. http://www.aaai.org/ocs/index.php/
authorship attribution. Literary and Linguistic
ICWSM/09/paper/view/154 (accessed 30 June 2014).
Computing, 28(2): 229–36.
Burrows, J. (1987). Computation into Criticism: A Study of
Hoover, D. (2003a). Multivariate analysis and the study
Jane Austen’s Novels and an Experiment in Method.
of style variation. Literary and Linguistic Computing,
Oxford: Clarendon Press.
18(4): 341–60.
Burrows, J. (2002). ‘Delta’: A measure of stylistic differ-
Hoover, D. (2003b). Frequent collocations and authorial
ence and a guide to likely authorship. Literary and
style. Literary and Linguistic Computing, 18(3): 261–86.
Linguistic Computing, 17(3): 267–87.
Jockers, M. (2013). Macroanalysis: Digital Methods and
Burrows, J. (2004). Textual analysis. In: Schreibman, S.,
Literary History. Champaign: University of Illinois Press.
Siemens, R. and Unsworth, J. (eds), A Companion to
Digital Humanities. Oxford: Blackwell, pp. 323–47. Jockers, M., Witten, D. and Criddle, C. (2008).
Reassessing authorship of the ‘Book of Mormon’
Burrows, J. (2007). All the way through: Testing for
using delta and nearest shrunken centroid classifica-
authorship in different frequency strata. Literary and
tion. Literary and Linguistic Computing, 23(4): 465–91.
Linguistic Computing, 22(1): 27–48.
Kestemont, M., Luyckx, K., Daelemans, W. and
Caldeira, S. M. G., Petit Lobão, T. C., Andrade, R. F. S.,
Neme, A. and Miranda, J. G. V. (2006). The network Crombez, T. (2012). Cross-genre authorship verifica-
of concepts in written texts. European Physical Journal tion using unmasking. English Studies, 93(3): 340–56.
B, 49: 523–529. Koppel, M., Schler, J. and Argamon, S. (2009).
Cancho i Ferrer, R. (2005). The structure of syntactic Computational methods in authorship attribution.
dependency networks: Insights from recent advances Journal of the American Society for Information Science
in network theory. In: Levickij, V. and Altman, G. and Technology, 60: 9–26.
(eds), Problems of Quantitative Linguistics. Chernivtsi: Lai, Y.-C., Motter, A. E. and Nishikawa, T. (2004). Attacks
Ruta, pp. 60–75. and cascades in complex networks. In: Ben-Naim, E.,
Cancho i Ferrer, R., Solé, R. V. and Köhler, R. (2004). Frauenfelder, H. and Toroczkai, Z. (eds), Complex
Patterns in syntactic dependency networks. Physical Networks. Berlin–Heidelberg: Springer, pp. 299–310.
Review E, 69: 1–8. Lancichinetti, A., Radicchi, R., Ramasco, J. J. and
Craig, H. and Kinney, A. F. (2009). Shakespeare, Fortunato, S. (2011). Finding statistically significant
Computers, and the Mystery of Authorship. Cambridge: communities in networks. PLoS One, 6(4): 1–18.
Cambridge University Press. Luyckx, K. (2010). Scalability Issues in Authorship
Dalen-Oskam, K. Van (2014). Epistolary voices: The case Attribution. Diss. University. Antwerpen.
of Elisabeth Wolff and Agatha Dekken. Literary and Moretti, F. (2005). Graphs, Maps, Trees: Abstract Models
Linguistic Computing, 29(3): 443–51. for a Literary History. London–New York: Verso.

Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017 63


M. Eder

Mosteller, F. and Wallace, D. (1964). Inference and Based Translation Studies. Amsterdam: John Benjamins,
Disputed Authorship: The Federalist. Reprinted with a pp. 231–50.
new introduction by John Nerbonne. Stanford: CSLI Rybicki, J. and Heydel, M. (2013). The stylistics and styl-
Publications, 2007. ometry of collaborative translation: Woolf’s ‘Night and

Downloaded from https://academic.oup.com/dsh/article/32/1/50/2957386 by Uniwersytet Jagiellonsky w Krakowie user on 12 January 2022


Newman, M. E. J. (2006). Finding community structure Day’. Literary and Linguistic Computing, 28(4): 708–17.
in networks using the eigenvectors of matrices. Physical Schöch, C. (2013). Fine-tuning our stylometric tools:
Review E, 74: 1–19. Investigating authorship, genre, and form in French
Noecker, J., Ryan, M. and Juola, P. (2013). Psychological classical theater. In: Digital Humanities 2013:
profiling through textual analysis. Literary and Conference Abstracts. University of Nebraska–Lincoln,
Linguistic Computing, 28(3): 382–7. pp. 383–86.
Paradis, E., Claude, J. and Strimmer, K. (2004). APE: Sinclair, S. and Rockwell, G. (2014). Voyant tools. http://
Analyses of phylogenetics and evolution in R language. voyant-tools.org (accessed 30 June 2014).
Bioinformatics, 20: 289–90. Tabata, T. (2012). Approaching Dickens’ style through
Pennebaker, J. W. (2011). The Secret Life of Pronouns: random forests. In: Digital Humanities 2012:
What our Words Say About Us. New York: Conference Abstracts. Hamburg: University of
Bloomsbury Press. Hamburg, pp. 388–91.
Posavec, S. (2007). Literary organism. http://www.stefa Vickers, B. (2011). Shakespeare and authorship studies in
nieposavec.co.uk (accessed 30 June 2014). the twenty-first century. Shakespeare Quarterly, 62:
R Core Team (2013). R: A Language and Environment for 106–42.
Statistical Computing. Vienna, Austria: R Foundation Ward, J. H. (1963). Hierarchical grouping to optimize an
for Statistical Computing. http://www.r-project.org/ objective function. Journal of the American Statistical
(accessed 30 June 2014). Association, 58: 236–44.
Rudman, J. (1998a). Non-traditional authorship attribu-
tion studies in the ‘Historia Augusta’: Some caveats.
Literary and Linguistic Computing, 13(3): 151–57.
Notes
Rudman, J. (1998b). The state of authorship attribution
1 An earlier version of the Section 2 has been published
studies: Some problems and solutions. Computers and
in a paper discussing relations between the Greek New
the Humanities, 31: 351–65.
Testament and its Latin translation (Eder, 2013c).
Rudman, J. (2003). Cherry picking in nontraditional 2 The newest versions of the package ‘stylo’ are posted
authorship attribution studies. Chance, 16: 26–32. at the Computational Stylistics Group webpage
Rybicki, J. (2012). The great mystery of the (almost) in- (https://sites.google.com/site/computationalstylistics/),
visible translator: Stylometry in translation. In: Oakes, with a concise manual, installation instructions, and
M. and Ji, M. (eds), Quantitative Methods in Corpus- other supplementary materials.

64 Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017

You might also like