Professional Documents
Culture Documents
Digital Scholarship in the Humanities, Vol. 32, No. 1, 2017. ß The Author 2015. Published by Oxford University 50
Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com
doi:10.1093/llc/fqv061 Advance Access published on 2 December 2015
Visualization in stylometry
analysis, multidimensional scaling, cluster analysis, re-composition to test whether the attribution ac-
Delta, Zeta, and Iota. Despite their limitations (the curacy depends on particular constellation of texts
lack of validation of the obtained results being the used in the analysis (Eder and Rybicki, 2013), a
most obvious), they are still widely used. study aimed to examine the performance of untidily
trials suggest that, for such data as we are than 100 samples: for the sake of speed, the
examining, complete linkages, squared optimal clustering was not a priority (Ward,
Euclidean distances, and standardized vari- 1963, p. 236).
ables yield the most accurate results’ (3) Even if some issues still remain unresolved,
Endless discussions of how many frequent words literary tradition (e.g. sources of inspiration), and
or n-grams should be taken into account (e.g. probably many more. Arguably, literary quality
Mosteller and Wallace, 1964; Hoover, 2003a; somehow depends on education, genre depends on
Burrows, 2007; Koppel et al., 2009; Eder, 2013b; topic, authorial voice is affected by chronology,
Fig. 2. Cluster analysis of 66 English novels, 300 MFWs, classic Delta distance, Ward’s linkage.
The problems do not end here: a detailed inspec- (Fig. 3). Almayer’s Folly jumps back from Kipling’s
tion of multiple dendrograms generated for grad- branch to Conrad’s cluster exactly between the
ually increasing number of features (MFWs) shows words 969 and 970 on the frequency list (Fig. 4).
that substantial rearrangements might occur quite The knowledge that this 970th word is ‘wine’ does
suddenly. An example of this behavior is shown in not help much, however, since multivariate analyses
Figs 3 and 4. Cluster analysis using McQuitty’s link- take into consideration a great number of features at
age and 136 MFWs (not shown) reveals a perfect a time. The word ‘wine’, not very discriminative
authorial recognition, but when 137 MFWs are itself, was the factor to tip the scale in favor of
used, the cluster for Joseph Conrad is split into Conrad. What is more important here is the side-
two parts and remains detached (along many effect: apart from the local Kipling/Conrad change,
other substantial rearrangements of the corpus) the whole dendrogram has been severely affected
until the same corpus is assessed at 969 MFWs and, in consequence, significantly reshaped. Such
Fig. 4. Cluster analysis of 66 English novels, 970 MFWs, classic Delta distance, McQuitty’s linkage.
abrupt changes seem to be a rule rather than the will be probably dropped simply because they do
exception, at least for textual data sets. not fit the scholars’ expectations. An interesting vari-
The decision which of the dendrograms presented ant of cherry-picking is discussed by Vickers, who
above reveal the actual separation of the samples and writes about the ‘visual rhetoric’ of different lines,
which show fake similarities is not trivial at all. arrows, colors, and so forth added to a graph; while
Generating hundreds of dendrograms covering the helpful, at the same time they suggest apparent
whole spectrum of MFWs, a variety of linkage algo- separations of samples (Vickers, 2011, p. 127).
rithms, and a number of distance measures, would
make this choice even more difficult. At this point, a
stylometrist inescapably faces the abovementioned 4 Consensus Tree, or Many
cherry-picking problem (Rudman, 2003). When it Dendrograms Combined
comes to choosing the plot that is the most likely
to be ‘true’, scholars are often in danger of more or A partial solution of the cherry-picking problem
less unconsciously picking the one that looks more involves combining the information revealed by
reliable than others, or that simply confirms their numerous dendrograms into a single consensus
hypotheses. If common sense is used to evaluate plot. This technique has been developed in phylo-
the obtained plots, any counter-intuitive results genetics (Paradis et al., 2004) and later used to assess
differences between Papuan languages (Dunn et al., however, other limits of consensus tree approaches
2005). It has been also introduced into stylometry become painful, especially when the number of ana-
(Eder, 2013b) and applied in a number of stylomet- lyzed texts increases. The technique introduced
ric studies (Rybicki, 2012; Rybicki and Heydel, 2013; below is aimed at overcoming these limits.
investigate the process of word network growth the nearest neighbor of the disputed sample. To do
given a number of n sequences (Caldeira et al., this, stylometric distance between each pair of sam-
2006), and recently to visualize relations in a ples is estimated, and then the texts are ordered from
corpus of a few hundred English novels (Jockers, the most to the least similar. To give an example: in
2013). The method introduced below is somewhat the case of The Jungle Book by Kipling, the ranking
inspired by these studies. It relies on the assumption begins with Kim (the nearest neighbor), the next is
that particular texts can be represented as nodes of a Captains Courageous, then Lord Jim by Conrad, and
network, and their explicit relations as links between so on, and the last place in this procession is given to
these nodes. The most significant difference, how- Gulliver’s Travels by Swift. Each text in the corpus is
ever, between the approaches applied so far and the associated with its own ranking of neighbors, from
present study is the way in which the nodes are the nearest to the farthest one.
linked. This new procedure of linking is two-fold: Now, these rankings can be reused to produce a
one of the involved algorithms computes the dis- stylometric network. In a simple variant, the links
tances between analyzed texts, the other is respon- would be established between nearest neighbors
sible for establishing a consensus of links. only: Kipling’s The Jungle Book connected to Kim,
A typical approach to authorship attribution in- Hardy’s Far from the Madding Crowd connected to
volves a comparison of a disputed (anonymous) Jude the Obscure, and so forth. However, since in
sample against a reference corpus, in order to identify literature-oriented studies, weaker or hidden textual
relations are potentially more interesting than expli- an implementation of the idea of consensus dendro-
cit similarities, it makes sense to use the rankings grams as discussed above into network analysis. The
more extensively. In stylometric terms, it means that goal is to perform a large number of tests for simi-
runners-up (i.e. a few texts that have been ranked larity with different number of features analyzed
immediately after the nearest neighbor) should not (e.g. 100, 200, 300, . . . , 1,000 MFWs). Finally, all
be excluded from the analysis, even if, in typical the connections produced in particular ‘snapshots’
approaches to classification, these runners-up are are added, resulting in a consensus network.
considered as unwanted noise and routinely filtered Weights of these final connections tend to differ
out. significantly: the strongest ones mean robust nearest
Let the algorithm establish, then, for every single neighbors, while weak links stand for second-
node, a strong connection to its nearest neighbor ary and/or accidental similarities. Validation of the
(i.e. the most similar text), and two weaker connec- results—or rather self-validation—is provided by
tions to the 1st and the 2nd runner-up. The outline the fact that consensus of many single approaches
of the algorithm is represented in Fig. 6 (top). to the same corpus sanitizes robust textual simila-
Consequently, the final network will contain a rities and filters out apparent clusterings.
number of weighted links, some of them being The two algorithms combined, one is presented
thicker (close similarities), some other revealing with a robust picture of actual (strong) clusterings,
weaker connections between samples. Arguably, in emerging from an ethereal web of weaker stylistic
most literary analyses, the thick connections will similarities in the background. The above two-fold
betray authorial similarities (usually the strongest procedure of linking is implemented in the package
stylometric signal), while thin links will reflect ‘stylo’, an open-source stylometric library written in
hidden layers of subtle intertextual correlations. In the R programing language (R Core Team, 2013)
this article, it is assumed that three neighbors—a and available at CRAN repository (http://cran.r-
nearest one and its two runners-up—provide project.org).2
enough information about weaker similarities. The next crucial step in network analysis is to
However, one can set any number of neighbors to arrange the nodes on a plane in such a way that
be connected. An empirical comparison of different they reveal as much information about linkage as
ways of connecting the nodes will be discussed in a possible. Apart from very small networks that can be
separate study. arranged manually, usually an algorithmic layout
The second algorithm (Fig. 6, bottom) is aimed is applied. In the present study, one of the force-
at overcoming the problem of unstable results. It is directed layouts was chosen, namely the algorithm
ForceAtlas2 embedded in GEPHI, an open-source the predominance of authorial signal in the data set.
tool for network manipulation and visualization What is more interesting, however, is the relations
(Bastian et al., 2009). Force-directed layouts between particular authorial clusters—and this is one
perform gravity-like simulation and pull the most- notable advantage of networks over consensus trees.
connected nodes (i.e. the ones that have several links The outliers include Austen, Trollope, James, and
and/or their links are very strong) to the center of Conrad, while the central parts are occupied mostly
the network, while the least connected nodes are by the works of Dickens and Sterne. A circle of
pushed outside. immediate satellites formed by Hardy, Galsworthy,
A network produced using the above procedure the Brontës, Richardson, Fielding, and Thackeray is
is fairly informative per se: it usually reveals some also noteworthy. Moreover, modularity-based
clusterings discoverable with the naked eye, color assignment sheds new light onto the already-
some centrally located nodes as well as peripheries, interesting picture: while different works of a given
some denser and sparser areas, and so forth. At the author are usually recognized to form a distinct
same time, however, such a network can be group, notable exceptions include a common cluster
subjected to a variety of standard measures used for Richardson, Fielding, Swift, and Scott; another
in networks analysis, which make the interpretation common cluster is formed by the Brontë sisters,
of the results more complete. These include meas- and the Dickensian oeuvre is split into two discrete
ures of network size, its density, centrality of the groups (quite well connected with each other,
nodes (closeness, betweenness, degree), and others. though). Last but definitely not least, the network
The measure of modularity, used as a community clearly shows a chronological pattern undiscoverable
detection tool, might be particularly helpful to in- using consensus trees: a diagonal timeline beginning
terpret clusters of stylistically similar texts. at the left side of the network, i.e. the late 18th-
In Fig. 7, a network of 66 English novels produced century area occupied by Fielding, Richardson, and
using the above procedure is shown. Spatial arrange- Swift, through the Victorians (roughly in the
ment of the nodes was established by the said force- middle), all the way to the early modernist Joseph
directed layout, and the nodes’ colors were assigned Conrad.
according to the modularity measure. The network is Modularity is not the only way in which stylistic
clearly split into a few groups that obviously confirm properties of particular texts/nodes can be assessed.
Another useful yet extremely simple measure is the easily identify a few hotspots—they represent the
degree or the number of connections that a particu- ‘radiating’ hubs, or the texts from which the
lar node has. The real potential of this measure, number of outcoming links is the highest. These
however, comes on stage when the nodes are are: Dorian Gray by Wilde (12 links), Sentimental
Fig. 9. Consensus network of 66 English novels (directed): the degree of incoming links marked in color.
Netherlands): I am grateful to Sally Wyatt, Andrea Dunn, M., Terrill, A., Reesink, G., Foley, R. and
Scharnhorst, and Karina van Dalen-Oskam for the Levinson, S. (2005). Structural phylogenetics and the
many inspiring discussions we had during my stay reconstruction of ancient language history. Science, 309:
in Amsterdam. I am also grateful to the anonymous 2072–75.
Mosteller, F. and Wallace, D. (1964). Inference and Based Translation Studies. Amsterdam: John Benjamins,
Disputed Authorship: The Federalist. Reprinted with a pp. 231–50.
new introduction by John Nerbonne. Stanford: CSLI Rybicki, J. and Heydel, M. (2013). The stylistics and styl-
Publications, 2007. ometry of collaborative translation: Woolf’s ‘Night and