Kauhanen - Walkden - 2017 - Deriving The Constant Rate Effect

Nat Lang Linguist Theory
DOI 10.1007/s11049-017-9380-1
Deriving the Constant Rate Effect
Henri Kauhanen1 George Walkden2
Received: 10 December 2015 / Accepted: 24 April 2017

The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract The Constant Rate Hypothesis (Kroch 1989) states that when grammar
competition leads to language change, the rate of replacement is the same in all con-
texts affected by the change (the Constant Rate Effect, or CRE). Despite nearly three
decades of empirical work into this hypothesis, the theoretical foundations of the
CRE remain problematic: it can be shown that the standard way of operationalizing
the CRE via sets of independent logistic curves is neither sufficient nor necessary for
assuming that a single change has occurred. To address this problem, we introduce
a mathematical model of the CRE by augmenting Yangs (2000) variational learner
with production biases over an arbitrary number of linguistic contexts. We show that
this model naturally gives rise to the CRE and prove that under our model the time
separation possible between any two reflexes of a single underlying change neces-
sarily has a finite upper bound, inversely proportional to the rate of the underlying
change. Testing the predictions of this time separation theorem against three case
studies, we find that our model gives fits which are no worse than regressions con-
ducted using the standard operationalization of CREs. However, unlike the standard
operationalization, our more constrained model can correctly differentiate between
actual CREs and pseudo-CREspatterns in usage data which are superficially con-
nected by similar rates of change yet clearly not unified by a single underlying cause.
More generally, we probe the effects of introducing context-specific production bi-
ases by conducting a full bifurcation analysis of the proposed model. In particular,
this analysis implies that a difference in the weak generative capacity of two com-
B H. Kauhanen
henri.kauhanen@manchester.ac.uk
G. Walkden
george.walkden@uni-konstanz.de
1 Linguistics and English Language, School of Arts, Languages and Cultures, The University of
Manchester, Oxford Road, Manchester, M13 9PL, United Kingdom
2 Department of Linguistics, University of Konstanz, Universittsstrae 10, 78464 Konstanz,
Germany
H. Kauhanen, G. Walkden
peting grammars is neither a sufficient nor a necessary condition of language change

when contextual effects are present.
Keywords Constant Rate Effect Language change Dynamical systems

Mathematical models Nonlinear regression
1 Introduction
1.1 The Constant Rate Effect
In a seminal paper in historical syntax, Kroch (1989) proposed the Constant Rate
Hypothesis:
[W]hen one grammatical option replaces another with which it is in competi-
tion across a set of linguistic contexts, the rate of replacement, properly mea-
sured, is the same in all of them. (Kroch 1989:200)
Initially (and still logically) a hypothesis, the notion of a constant rate has accu-
mulated enough support over the last three decades for this to be referred to as the
Constant Rate Effect, or CRE (see e.g. Pintzuk 2003:511).
The logic behind CREs is as follows: if a variant replaces another variant in two or
more different contexts and the rate of change is the same in each of these contexts,
then we should assume that only a single change has occurred. CREs have therefore
been deployed to argue that two or more apparently unrelated surface changes are
in fact manifestations of a single underlying change (Fig. 1). Unifying changes in
this way provides strong support for approaches to language in which syntactic vari-
ation consists not primarily in lexical or contextual idiosyncrasies but in the values
of a finite number of universal parameters, as in the classical Principles & Param-
eters approach (Chomsky 1981; Chomsky and Lasnik 1993). There is no necessary
link between Principles & Parameters and CREs, as Pintzuk (2003:511) emphasizes;
the variationist approach within which the Constant Rate Hypothesis is couched is
theory-neutral. However, CREs are a useful tool in the armoury of diachronic syntac-
ticians who wish to argue for the controlling effect of abstract grammatical analyses
on patterns in usage data (Kroch 1989:239).
CREs offer a fresh perspective on the causation of changes. Kroch (1989:238)
criticizes the approach to causation in which the finding that a given context is most
favourable to the use of an innovation is taken to show that the innovation is an ac-
commodation to the linguistic functionality of that context. Where there is a disparity
between contexts that share the same rate of change, this reflects functional effects,
discourse and processing, on the choices speakers make among the alternatives avail-
able to them in the language as they know it; and the strength of these effects remains
constant as the change proceeds (Kroch 1989:238). In other words: surface changes
are to be thought of as reflexes of underlying grammatical changes; the discrepancies
in frequencies seen at the surface level are due to extra-grammatical factors, or con-
textual effects, which are independent of the underlying change itself and constant
across time.
Fig. 1 A classical example of a

CRE: the emergence of
do-support in Early Modern
English in negative declarative
and four types of interrogative
sentences; data from Kroch
(1989:224, Table 3). Periphrastic
do is adopted at slightly different
times in the different contexts,
but the rate of adoption appears
to be similar across contexts.
Kroch (1989) identified loss of
V-to-T movement as the
underlying parametric change
responsible for this constant rate
The usual procedure for detecting a CRE in some diachronic data is to fit a logistic
curve (1) to each of the contexts separately and then to compare the growth rates of
these curves against each other.
es(tk) 1
pt = = (1)
1 + es(tk) 1 + es(tk)
Here, pt is the frequency of either the innovatory or the receding variant (or parame-
ter value) in a given context at time t, and s is the (time-independent) rate of change
in that context. The k parameter serves to translate the curve along the time axis,
indicating the point of greatest growth, or the tipping point, of pt (Fig. 2). With this
operationalization, we have the following procedure for establishing a CRE: a logistic
curve of the form (1) is first fit to each of the contexts of interest separately and in-
dependently. Then, if variation among the s or slope parameters for these curves is
found to fall within a reasonable confidence interval, the change is said to proceed at
the same (constant) rate in all contexts. Variation among the k or intercept param-
eters, on the other hand, is allowed and is where the contextual effects, independent
of the underlying grammatical change, are thought to manifest themselves. This is the
procedure used in a number of studies that have sought to establish CREs in various
processes of change across a number of languages (e.g. Kroch 1989; Santorini 1993;
Pintzuk 1995; Kallel 2005; Pintzuk and Taylor 2006; Kallel 2007; Fruehwald et al.
2009; Postma 2010; Durham et al. 2012; Wallage 2013; Gardiner 2015). Henceforth,
we shall refer to it as the standard operationalization.1
1 There exist a number of methods to implement this procedure, such as nonlinear regression on bare fre-
quencies, linear regression on logit-transformed data, and multivariate regression. Which method is chosen
is a technical matter; conceptually, all of these implementations share the basic theoretical assumption that
the reflexes of one underlying change are described by a family of logistics agreeing in their s parameters
but possibly differing in their k parameters.
Fig. 2 Three logistic curves (1)

with identical s (slope)
parameters but differing k
(intercept) parameters
1.2 The non-linking problem
Initially, the logistic function (1) was adopted because of its practicability and its
success in other disciplines such as population genetics, not because it followed from
any established first principles:
[G]iven the mathematical simplicity and widespread use of the logistic, its use
in the study of language change seems justified, even though, unlike in the
population genetic case, no mechanism of change has yet been proposed from
which the logistic form can be deduced. (Kroch 1989:204)
The logistic has since been derived from mathematical models of language acquisi-
tion independently by Niyogi and Berwick (1997) and Yang (2000); Ingason et al.
(2013) provide a particularly clear illustration of how syntactic acquisition in succes-
sive generations can give rise to logistic change at population level. What has never
been explicated in detail, however, is why different contextual reflexes of a single
underlying change should be governed by logistics agreeing in their s parameters but
freely varying in their k parameters: even though this operationalization has proved
useful in gathering empirical support for the Constant Rate Hypothesis, it is not a
model of the CRE itself.2 In short, while the standard operationalization may ad-
equately describe historical data, it fails to explain it, suggesting no mechanism for
how contextual reflexes spring from underlying changes. The fact that under the stan-
dard formulation the independent contextual reflexes are not linked to each other, or
to anything else, in this stronger sense we call the non-linking problem, and there are
a number of reasons to believe that the problem is serious enough to warrant that the
standard operationalization of CREs should be rejected.
Firstly, note that fitting a number of independent logistics to a number of contexts
in some data leaves variation among the k parameters entirely unexplained, even if
we assume that the logistics agree in their s parameters as required by the standard
operationalization. In principle, it is possible for this variation in k to be arbitrar-
ily large, and it is therefore in principle possible to connect two clearly unrelated
2 Postma (2017) notes that the logistic function is the general solution of Verhulsts differential equation
and that a family of contextual curves results when this differential equation is solved for specific initial
conditions. While a true statement, this does not constitute a model of the CRE in the strict sense and does
not solve the problems we outline below.
changespossibly separated by millennia on the time axisas long as they happen

to share the same growth rate. In principle, then, it is possible to be led to the absurd
conclusion that a single change runs to completion in one context before it even takes
off in another (Fig. 2).
Secondly, there are reasons to think that not all instances of logistics agreeing
in their s parameters are in fact CREs in the sense that a single underlying gram-
matical change is being modulated in the usage of a speaker or group of speakers
by constant contextual (functional, discourse-related, etc.) effects. The relevant ev-
idence comes from studies in which the contextual effects are not within-speaker
but between-speaker effects or even outright contingencies. Wallenberg (2016) shows
that relative clause extraposition is a gradually declining option across the histories
of both English and Icelandic, and that the s parameters of the two curves do not dif-
fer significantly. Similarly, Willis (2017), in his study of the spread of the innovative
second-person pronoun chdi in the recent history of Welsh, finds that in different re-
gions of Wales the change is more or less advanced (i.e. different intercepts) but that
the slopes of the changes are not significantly different. Corley (2014) tests for a CRE
in the usage of negative concord between female and male speakers of Early Modern
English, using data from Nevalainen and Raumolin-Brunberg (2003), and again finds
no significant difference in slopes. What these case studies show is that the s param-
eters of two different changes may be similar for reasons other than being reflexes of
a single abstract grammatical pattern, and thus that identity of slope parameters is not
a sufficient condition for the assumption of a single underlying change. Wallenberg
(2016:e244), for instance, notes explicitly that these are different populations, and
suggests that the similarity of slopes may indicate that the same forces are under-
lying the change in both the English and the Icelandic populationsbut however
these forces are to be understood, we cannot be dealing with a CRE in the traditional
sense, as all these authors recognize.3
These two problems are in most cases only technical in the sense that a researcher
will usually have independent reasons for ruling out such fantastical hypotheses: in
particular, the inference that two apparently separate changes are reflexes of the same
underlying phenomenon is usually motivated by a particular structural analysis which
is arrived at on independent grounds. However, the theoretical importance of these
problems is great: they demonstrate that the standard independent logistics formula-
tion of the CRE can serve at most as a proxy to CREs, not as a model of them. If
the CRE is a phenomenonand the empirical support gathered for it over the last
three decades suggests it isthis means we have so far failed to model one of the
more well-established facts about language diachrony. Consequently, we have only a
very approximate understanding of the dynamics of language change in the presence
of contextual factors, and a number of questions remain wide open: if underlying
3 Paolillo (2011) raises a problem that may be related. The standard way of testing for the statistical signif-
icance of a putative CRE is to perform a chi-square test of independence on the s values of the regressions
for the different contexts (Kroch 1989; Santorini 1993; Pintzuk 1995). If the result is not statistically signif-
icant, then it is concluded that there is support for a CRE. However, it is not sound to treat a non-significant
value as evidence for the null hypothesis, since it was assumed to begin with. We acknowledge this prob-
lem and have no solution to it in the present paper, except insofar as our method of modelling CREs does
not rely on null-hypothesis significance testing at all.
changes are to be thought of as competition between two or more parametric options

or grammars, and if CREs are thought to appear because of some sort of performance
effects operating over that process of competition, how, exactly, do the two processes
interact? What role does the magnitude of the performance effects play in the over-
all change? Could contexts that favour the innovatory variant be so favouring as to
accelerate the change, and if so, can this accelerating effect be quantified and mea-
sured? Similarly, could disfavouring contexts slow the change down? Could they even
block change in certain cases? These questions can only be answered with the help of
mathematical models of change that accommodate mechanisms for both grammat-
ical competition and contextual effects, and also define, without equivocation, the
possible interactions between these two mechanisms over time.
The non-linking problem has, of course, not gone unnoticed in the literature. As
Roberts (2007) puts it:
One might wonder why [the CRE] should hold. It is unlikely to be a fact
about the grammars themselves. Instead, it is plausible that it may be a fact ei-
ther about speech communities or about the ways in which individuals choose
among grammars available to them. As such, it may be attributable to soci-
olinguistic factors or to the dynamics of populations, or both factors acting in
tandem. (Roberts 2007:313)
Our aim in this paper is to propose a solution to the non-linking problem, and our
concrete proposal is that the CRE occurs because of context-specific production bi-
ases which serve either to promote or to hinder an underlying change in progress.
That is to say, we will argue that the CRE is indeed a fact about the ways in which
individuals choose among grammars available to them, and propose a rigorous math-
ematical model of this kind of speaker behaviour. The result is a first step towards
a mechanistic model of the CRE that not only describes the diachronic phenomenon
but explains it by deriving it from independently plausible first principles of language
acquisition and use.
1.3 Plan
The paper is structured as follows. In Sect. 2, we augment Yangs (2000) mathemat-

ical model of grammar competition with production biases to account for variability
across contexts. This results in a dynamical system in which the evolution of the
underlying changea parameter switch for usand the evolution of the usage fre-
quencies in a number of linguistic contexts feed into each other iteratively. We then
derive analytical expressions for the time evolution of this system and show in Sect. 3
that in most cases it can be approximated by a constrained set of equations based on
one logistic. In Sect. 4, this approximation is used to derive a theorem concerning
the possible temporal separation between two reflexes of one underlying change: we
show analytically that under our proposed model the time separation between con-
texts always has a finite upper bound which is inversely proportional to the rate of
the underlying change. This solves the problem of unconstrained variation in the k
parameters: under our model, it is no longer possible for the curves of different con-
texts to be radically distant from each other in time. In the remainder of Sect. 4, we
proceed to test the model empirically from two complementary angles: (1) by in-
vestigating whether time separations observed in a number of previously established
CREs agree with the predictions of our time separation theorem, and (2) by testing
whether our model is able to distinguish actual CREs from pseudo-CREs, that is, sur-
face changes that proceed at similar rates accidentally but that are clearly not reflexes
of one and the same underlying change.
A side product of this investigation is an extension of some of the analytical re-
sults in Yang (2000). In Sect. 3, we uncover all possible outcomes of the dynamical
interplay between grammatical competition and production biases. A full bifurca-
tion analysis of the two-grammar case shows that production biases can both induce
change in settings where Yangs (2000) model outlaws change, and block change
in settings where Yangs (2000) model predicts change. On the assumption that a
model which incorporates the possibility of production biases is more realistic than
one that does not, then, the assumption that language change is driven (solely) by
distributional differences in the proportion of sentences parsed by different compet-
ing grammars is shown to be too simple. A theorem resulting from our analysis of
the extended model shows that, when production biases are in operation, such differ-
ences are neither necessary nor sufficient for change, though they continue to play
an important role in any given change process in a way that can be quantified ex-
actly. In Sect. 5, we offer a brief account of the nature of production biases; Sect. 6
concludes.
2 Grammar competition and production biases
2.1 Learning competing grammars
Empirical work on language variation and change has demonstrated the limitations of
the traditional view of parameter setting as a once-and-for-all process which leaves
the learner with a unique grammar at the point of maturation: speakers have, at least
during periods of change, access to more than one grammar (Kroch 1989, 1994,
2000; Santorini 1992; Pintzuk 2003). As pointed out by Santorini (1992:619), this
intra-individual co-existence of multiple grammatical systems is an ability for which
the phenomena of multilingualism, diglossia and intrasentential code-switching pro-
vide independent and incontrovertible evidence; see also the discussion in Roberts
(2007:319331). This notion has been formalized by Yang (2000, 2002) in his math-
ematical model of competition-driven change, on which our model of the CRE is
based. We therefore begin by reviewing the operating principles behind this model,
focussing on the presentation in Yang (2000).
This model construes language change as a learning process in a homogeneous,
well-mixing population with non-overlapping generations. At each iteration, we can
therefore think of the population as a single individual who sets parameters based
on the linguistic output of the previous generation, abstracting away entirely from
the social and geographical structure of that population. In the competing grammars
framework, each biologically possible grammar of human language Gi is associated
with a weight which gives the probability of an individual using that grammar. The
Fig. 3 Venn diagram of the

sentences generated by G1 (the
set L1 LX ) and by G2 (the set
L2 LX ). Here, L2 represents a
greater proportion of sentences
than L1 , and so the advantage of
G2 is greater than that of G1
framework allows any number of those grammars to compete; however, during well-
studied and relatively well-understood periods of language change, it usually seems
to be the case that two grammars are in competition. Since, additionally, this renders
the mathematics of the model particularly tractable, we focus on the two-grammar
case in all that follows.
Let G1 and G2 be these two grammars, and denote their weights with pt and
qt , respectively, indexed for generational time t.4 The basic insight behind Yangs
(2000) model is that each grammar has its time-independent (parsing) advantage,
which is simply the proportion of sentences the other grammar cannot parse (out of
all sentences generated, in abstracto, by either grammar). There are then fundamen-
tally three kinds of sentence: sentences of type L1 , which G1 but not G2 parses;
sentences of type L2 , which G2 but not G1 parses; and sentences of type LX , which
both grammars parse (Fig. 3). The language learner receives primary linguistic data
(PLD)the linguistic output of the generation at time step tand his task is to arrive
at weights pt+1 and qt+1 for the two competing grammars in his own generation. Let-
ting denote the advantage of G1 and that of G2 , then (assuming he samples his
environment uniformly) the learner is confronted with a number of sentences drawn
from the following distribution:
L1 L 2 LX
G1 pt 0 (1 )pt (2)
G2 0 qt (1 )qt
Based on this input, the learner is assumed to set parameters in accordance with linear
rewardpenalty learning, an off-the-shelf learning algorithm from mathematical psy-
chology (Bush and Mosteller 1951, 1958; Narendra and Thathachar 1989). Of crucial
importance here are the two quantities ct = qt and dt = pt = (1 qt ), known as
the penalty probabilities of the two grammars: ct is the probability of the learner en-
countering a sentence which G1 cannot parse and dt the probability of a sentence
which G2 cannot parse. It can be shown (Narendra and Thathachar 1989:162163)
that, if the learners training sample is large enough, eventually he ends up with a
weight qt+1 which is well approximated by
ct
qt+1 = . (3)
ct + dt
4 In what follows, we are mostly concerned with equations for q (the weight of G ) and consider the
t 2
conditions under which G2 will replace G1 . The corresponding value of pt can always be recovered from
the fact that in a two-grammar setting, pt + qt = 1.
Assuming ct = 0 and dt = 0 without loss of generality, this equation may be reduced

to the more useful form

dt 1 1 qt 1
qt+1 = 1 + = 1+ , (4)
ct qt
where we write = / for the ratio of the parsing advantages. Equation (4), then,
relates the grammar weights of the (t + 1)th generation to those of the tth genera-
tion, thereby defining the inter-generational or diachronic dynamics of a sequence of
(reliable) linear rewardpenalty learners.5
It follows that < , or < 1, is a sufficient condition for grammar G2 to over-
take grammar G1 :
Theorem 1 (The Fundamental Theorem of Language Change; Yang 2000:239) As-

sume reliable learners, so that (4) holds. Then qt 1 as t if < , and qt 0
as t if > .
In other words, the grammar with the greater parsing advantage will necessarily
win out in the long term. The difference equation (4) may in fact be solved for t to
yield

1 q0 1
qt = 1 + t , (5)
q0
where q0 is the weight of G2 at the point of actuation of the change (Appendix A.1,
Corollary 4): hence as soon as the value of q0 is known, the entire change trajectory
can be predicted. Furthermore, it is not difficult to show that this solution is equivalent
to
1
qt = 1 + es(tk) (6)
with s = log() and k = log()1 log(q01 1). Thus, assuming that learners
receive representative samples of their linguistic environments, a diachronic sequence
of such learners exhibits logistic evolution. In particular, the slope of the trajectory
is directly dependent on the advantage ratio such that the smaller (the more
advantageous G2 is), the faster the change from G1 to G2 , and vice versa.
5 For two competing grammars, the linear rewardpenalty learning algorithm assumes the following form
for learning rate 0 < < 1 (see Yang 2000 for more details). Assuming that the learners initial guess
for the weight of grammar G2 is Q0 = 0.5 (no a priori bias), then, for input sentence s = 1, . . . , N ,
the learner picks G2 with probability Qs1 (and G1 with probability 1 Qs1 ), attempts to parse the
sentence, and sets Qs = Qs1 + (1 Qs1 ) if G2 parses s, and Qs = (1 )Qs1 if G2 does not
parse s. Thus, Q is increased with successful parsing events and decreased with unsuccessful parsing
events. Finally, we set qt+1 = QN . Under the simplifying assumption that N , the learner does
not have to contend with a finite dataset or a critical period. It is of course false, but like much work in
learnability and modelling we adopt it here in order to derive analytical approximations such as (4) which
would otherwise be difficult, if not impossible, to derive. This approximation holds in the following sense:
qt+1 converges to a normal distribution with mean qt+1 = ct /(ct + dt ) and a variance which tends to 0
as 0 and N (Narendra and Thathachar 1989:162163). Assuming a finite learning sample
would introduce a stochastic component (noise) to the system, and exploring the consequences of this falls
beyond the scope of the present paper.
Fig. 4 Venn diagram of the

sentences generated by G1 and
G2 , partitioned by four contexts
This is the gist of the competing grammars model of language change; for more
details, see Kroch (1994), Yang (2002), Pintzuk (2003) and especially Heycock and
Wallenberg (2013), who apply the model to a concrete case study involving the loss
of verb movement in Scandinavian.
2.2 Competing grammars and contextual biases
To account for contextual effects and the CRE, we now assume the existence of K
linguistic contexts 1, . . . , K, with each sentence generated by G1 or G2 belonging to
one and only one of these contexts.6 Each context i is equipped with a context weight
i that gives the proportion of sentences that fall in that context (out of all sentences
generated by either G1 or G2 ); clearly, since we are dealing with proportions, we
require 1 + + K = 1 (Fig. 4). In addition to these weights, each context is
associated with a fixed (constant over time) production bias bi which can be positive,
negative or zero. In the first case, the context favours G2 ; in the second, it favours
G1 ; and in the third case, the context is neutral with respect to the two grammars.7
Now consider a language learner acquiring his grammar weights based on the out-
put of generation t of speakers. With Yang (2000), we assume that the tth generation
has internalized grammar weights pt and qt . Where our treatment diverges is the
effect these weights have on the language acquisition process of the (t + 1)th gener-
ation. Rather than assuming that pt and qt feed directly into the acquisition process
in the following generation, we assume that speakers of the tth generation may pro-
mote or demote the two weights pt and qt in different linguistic contexts in different
ways, subject to the context-specific production biases bi . It is then on the basis of
this usage, modulated by the contextual biases bi and the context weights i , that the
next generation of learners must infer their grammar weights.
(i)
Letting qt denote the probability with which a speaker of the tth generation uses
(i)
grammar G2 in context i, and similarly for pt and G1 , a general form of this biasing
is

pt(i) = pt + F (bi , pt )
(7)
q (i) = q + G(b , q )
t t i t
6 Formally, this means that the contexts constitute a partition of the set L L L in the usual set-
1 X 2
theoretic sense: the contexts are pairwise disjoint subsets of L1 LX L2 and their union equals the
whole of L1 LX L2 .
7 Associating positive production biases with a favour for G over G (rather than G over G ) is but a
2 1 1 2
convention and does not affect the dynamics of our model: reverting the biases would merely swap the
labels of the two grammars.
Fig. 5 The biasing functions F

and G have to satisfy the
requirement |F |, |G|
min{pt , qt } = min{qt , 1 qt }
(see Appendix A.2 for a proof)
and thus land in the shaded
region of this plot. The
parabolic curve shown here
gives the most parsimonious
such upper bound, the product
pt qt = qt (1 qt )
where F and G are some (yet undetermined) functions which modulate the effect of
the bias bi on production. These functions must satisfy two requirements:

pt(i) + qt(i) = 1 (as G1 and G2 are the only grammars)
(8)
0 p (i) , q (i) 1 (as the two quantities are probabilities)
t t
and the following is a theorem.
Theorem 2 Functions F = F (bi , pt ) and G = G(bi , qt ) satisfy the conditions (7)

and (8) if, and only if, they satisfy

F = G
(9)
|F |, |G| min{pt , qt }
Proof Appendix A.2.
Theorem 2 thus implies that the functions F and G are necessarily the additive
inverse of each other, and that their absolute value is necessarily bounded from above
by the minimum of pt and qt . Technically an infinite number of functions satisfy
this pair of conditions, so we need to ask what these functions actually are (Fig. 5).
The simplest, most parsimonious choice is to consider the product pt qt , which is
guaranteed to be bounded from above by both pt and qt whenever 0 pt , qt 1. In
other words, we suggest setting
F = bi pt qt and G = bi pt qt (10)
with 1 bi 1. The contextual usage probabilities in (7) then assume the definite
forms

pt(i) = pt bi pt qt = pt bi pt (1 pt )
(11)
q (i) = q + b p q = q + b q (1 q )
t t i t t t i t t
where the contextual biases bi range from 1 (maximally G1 -favouring) through 0

(neutral) to 1 (maximally G2 -favouring).
This choice for the functions F and G has a number of intuitively satisfying fea-
tures. For example, (10) implies that if either pt = 1 or qt = 1, then F = G = 0 (since
in the first case qt = 0 and in the second case pt = 0 and consequently pt qt = 0) and
no biasing will apply. Empirically, this means that if a grammar has been acquired
categorically, no contextual biases will be able to skew usage in the direction of the
other grammar. This is intuitively right: if a grammatical option has been acquired
categorically, then by definition the competing option does not exist for the speaker
and no grammar-external biasing ought to be able to apply. This behaviour of our
biasing mechanism in the limits qt 1 and qt 0 is just one manifestation of a
more general feature of the model: that while the biases bi themselves are constant
and do not change over time, the magnitude of the effect of these biases on usage
does depend on the state of the underlying change: the effect is the strongest midway
through the change (from Fig. 5, we see that the effect is the strongest when qt = 0.5)
and tails off to zero in the limits qt 1 (completion) and qt 0 (actuation). As we
will see in Sect. 4, this is what the empirical data also show.8
With (11) in place, it is possible to work out the diachronic, inter-generational
dynamics of our model (assuming, again, that learners receive large input samples).
What generation t outputs in this extended model is not the distribution given in (2),
but a combination of grammar advantages ( and ), grammar weights (pt and qt ),
context weights (i ) and context biases (bi ). The penalty probability for grammar G1
now becomes

K
(i)

K
K
ct = i qt = i (qt + bi pt qt ) = qt + i bi pt qt = (qt + Bpt qt ),
i=1 i=1 i=1
(12)
where
K the index i runs through the contexts i = 1, . . . , K and where we write B =
b for convenience. 9 The quantity B, which may be regarded as the net bias
i=1 i i
operating on the language acquisition process weighted by the context proportions i ,
turns out to be a decisive quantity in our model: from (12), we immediately see that
if B = 0, the penalty ct reduces to the Yangian penalty ct = qt . Our model, then,
generalizes Yangs (2000) model and reduces to the latter in the special case that the
contextual biases are in balanceif either all the biases are zero or if G2 -favouring
(positive) biases cancel out the effect of G1 -favouring (negative) biases.
Entirely symmetrically, the penalty for grammar G2 reads
K
(i)
dt = i pt = (pt Bpt qt ). (13)
i=1
Assuming reliable learners, we may now use these penalty probabilities to write down
the inter-generational difference equation that relates qt+1 to qt for the extended
8 We further elaborate on the empirical grounding of our biasing mechanism in Sect. 5.

9 Note that 1 B 1, since 0 1 and 1 b 1.
i i
Fig. 6 Inter-generational change in Yangs (2000) model (top) and our model (bottom). After parameter
setting, the learner ends up with a weight qt for grammar G2 (and pt for grammar G1 ). In our model,
this weight is then attenuated in production by the context-specific production biases bi so that the actual
(i)
probability of using G2 in the ith context is qt = qi +bi pt qt (see text for details). This biased probability,
together with the advantage ratio = / and the context weight i , then determines the PLD for the
following generation
model: equation (4) becomes

dt 1 (pt Bpt qt ) 1
qt+1 = 1 + = 1+ . (14)
ct (qt + Bpt qt )
Recalling that pt = 1 qt , this may be written as
1
1 qt
qt+1 = 1 + t , (15)
qt
where = / as before and
1 Bqt
t = . (16)
1 + B(1 qt )
2.3 The Constant Rate Effect
To summarize, we propose to augment Yangs (2000) model of grammar competition

with a set of production biases bi which modulate the grammar weights pt and qt in
actual linguistic production. This modulation is implemented by a mechanism which,
we have shown, has to operate within certain analytical bounds. Within those bounds,
we have suggested that the most parsimonious mechanism be adopted, correspond-
ing to our particular choice of the bias-modulating functions F and G, as explained
above. The diachronic behaviour of this extended model is characterized by equa-
tions (11) and (15): the difference equation (15) gives the evolution of the underlying
grammar weight qt , whilst equation (11) supplies the context-specific value of this
probability, modulated by the contextual production biases. The flowchart in Fig. 6 il-
lustrates the inter-generational dynamics that result from this mechanism, comparing
our extension of Yangs (2000) model to the original.
Before moving on to an empirical evaluation of our proposed model, we first ask
whether it produces, in broad qualitative terms, the right kind of behaviour. To this
Fig. 7 The behaviour of our model with two different sets of production biases, for advantage ratio
= 0.5 and initial value q0 = 0.01 (1% usage of G2 at the point of actuation): the evolution of both
(1) (2)
the underlying probability qt () as well as that of the contextual usage probabilities qt (#), qt ()
(3)
and qt () is shown up to qt = 1 q0 = 0.99. In each case, the context weights are set at 1 = 0.2,
2 = 0.4 and 3 = 0.4. (a) Here the biases are b1 = 1, b2 = 1 and b3 = 0.5. With these choices,
B = 1 b1 + 2 b2 + 3 b3 = 0, and consequently the positively-biased contexts (# and ) cancel out the
effect of the negatively-biased context (), resulting in logistic evolution of the underlying probability qt
(). (b) Here the biases are b1 = 1, b2 = 1 and b3 = 0.5. Now B = 1 b1 + 2 b2 + 3 b3 = 0.4 < 0.
The two negatively-biased contexts ( and ) outweigh the one positively-biased context (#), and as a
consequence, the change from G1 to G2 takes much longer than in (a). The trajectory of qt is also not
strictly logistic in this case, as is evident from the fact that it is not symmetric about the midpoint qt = 0.5:
passage from q0 = 0.01 to qt = 0.5 takes longer than passage from qt = 0.5 to the final value qt = 0.99
end, Fig. 7 shows the behaviour of our model in two different situations involving
three arbitrarily chosen contexts: in a situation in which the contextual biases are
in balance and cancel each other out (B = 0; Fig. 7a), and in a situation in which
the net effect of biases in favour of the conventional variant G1 conspire against
the propagation of the innovative variant G2 (B < 0; Fig. 7b). Impressionistically,
our model produces a CRE in both cases: the probability of use of G2 increases
roughly at the same rate in each context, with a characteristic temporal shift between
the propagation curves of the individual contexts. This suggests that our model is
able to replicate the central intuition of Kroch (1989) that different reflexes of one
underlying change ought to proceed at similar rates, and that the output of our model
can, in principle, approximate the empirical situations that have been suggested as
CREs in the literature.
In these two cases, the evolution of the underlying probability qt is different, how-
ever, because of the different biasing that applies in each case. In Fig. 7a the contexts
are in balance (B = 0), which by the preceding analysis implies that the evolution
of qt itself is logistic. In Fig. 7b, on the other hand, G1 -favouring biases outweigh
G2 -favouring biases (B < 0), hindering the propagation of G2 . This is reflected in
the fact that the evolution of the underlying qt is slowed down. Even though G2 still
overtakes G1 in the limit, the trajectory of qt is no longer strictly logistic (careful
examination shows that it is not symmetric about the midpoint qt = 0.5, but rather
exhibits slower change for qt < 0.5 and faster change for qt > 0.5). This motivates
us to consider extreme model parameter regimes, particularly the subspace where B

is negative, in more detail.
In equation (15), the factor t depends on qt whenever B = 0. This complicates
the analysis of the extended model significantly: while the Yangian equation (4) can
be solved for t to yield the logistic function, we are not aware of a closed-form so-
lution to the more complex nonlinear difference equation (15) except in the singu-
lar case B = 0, where the equation reduces to (4). This has the undesirable practi-
cal consequence that there is no trivial way of fitting our model to datalacking a
closed-form curve for the underlying probability qt from which to derive curves for
(i)
the contextual reflexes qt , there simply are no closed-form contextual curves to fit.
The best one can do is to iterate the model for various choices of model parameter
values and initial conditions and compare the resulting trajectories against empirical
data, an approach which soon becomes computationally prohibitive as the number of
logically possible model parameter combinations grows as a superlinear function of
the number of model parameters. To tackle this problem, we will in the next section
conduct a full analysis of the behaviour of our model in the limit t and show
that, under most empirically meaningful combinations of model parameter values, the
underlying trajectory qt is well approximated by a logistic curve. Thus, even though
we cannot write down the solution of qt for arbitrary times t, and even though we
know that for some parameter values (such as when B < 0) the evolution of qt is
not logistic, we can use logistic functions to approximate the true value of qt . This
will form the basis of our curve-fitting procedure in Sect. 4. A reader who is will-
ing to skip the technicalities of the logistic approximation may advance straight to
Sect. 4.
3 Dynamics of the extended model
3.1 Advantage versus bias
As we have noted above, the Fundamental Theorem of Yangs (2000) model is that
a more advantageous grammar will necessarily overtake a less advantageous one: if
< 1 ( < ) and learners are reliable, then qt 1 as t , and thus grammar
G2 overtakes G1 (Theorem 1). A nontrivial consequence of extending the model with
production biases is that this theorem no longer holds: a difference in the proportion
of input parsed by the two competing grammars is neither sufficient nor necessary
for language change. While this is a minor observation from the point of view of the
CRE, which is the main focus of the present paper, the failure of the Fundamental
Theorem under suitable combinations of grammar advantages and production biases
is an interesting finding from the vantage point of the theory of language change
in general, and we will therefore pursue it briefly in this section. The bifurcation
scenario here outlined will also play a role in the logistic approximation that we
develop in the following subsection for model evaluation purposes.
The production biases bi can be positive, negative or zero. In the first case, the
context in question favours G2 over G1 ; in the second case, G1 is favoured; and

in the third case, the context is neutral. The scalar product B = K i=1 i bi of context
weights and production biases turns out to play a critical role in determining how, and
if, change from G1 to G2 happens. If there are negatively biased (G2 -disfavouring)
contexts, and if their share of all sentences in the language learners PLD is large
enough, change from G1 to G2 can be blocked even if the advantage of G2 is greater
than the advantage of G1 . On the other hand, if there are sufficiently strong positively
biased (G2 -favouring) contexts, G2 may overtake G1 even if the latters advantage
exceeds that of the former. A critical value Bc of the net bias B in fact exists such
that change from G1 to G2 is guaranteed whenever B > Bc but is blocked whenever
B Bc :
Theorem 3 (The Extended Fundamental Theorem of Language Change) Assume re-

(15) holds. Let q0 be the weight of grammar G2 at the point
liable learners, so that
of actuation, let B = K i=1 i bi , and let
1
Bc = . (17)
1 + q0 ( 1)
Then
1. qt 1 as t , if B > Bc ;
2. qt = q0 for all t, if B = Bc ;
3. qt 0 as t , if B < Bc .
In other words, G2 overtakes G1 if, and only if, B > Bc .
Proof Appendix A.3.
In dynamical-systems terminology, the production bias mechanism induces a bi-

furcation in the parameter space of the extended model: small tweaks made to either
the biases (bi ) or to the proportion of input falling in each context (i ) can alter the
trajectory of language change entirely by determining which of the two grammars
will win out (Fig. 8). An immediate consequence of Theorem 3 is that if no context
is negatively biased, G2 will overtake G1 whenever < 1:
Corollary 1 If < 1 and bi 0 for all contexts i, qt 1 as t .
Proof Since Bc < 0 for any choice of q0 , if < 1.
Even though production biases, then, can induce or block change in parameter
regimes where such behaviour is impossible in Yangs (2000) original model, there
are limits to how much of an effect the biases can have over grammar advantages.
Briefly put, if G2 is much more advantageous than G1 (0 < 1), then no amount
of negative bias can block change, and, on the other hand, if G1 is much more ad-
vantageous than G2 ( 1), no amount of positive bias can make G2 overtake G1 .
How much is much depends on the boundary condition q0 :
Corollary 2 If < q0 /(1 + q0 ), then qt 1 as t , regardless of the value of

B. If > (2 q0 )/(1 q0 ), then qt 0 as t , regardless of the value of B.
Fig. 8 The outcome of language change for different combinations of advantage ratio and net bias B,
for two different initial weights for G2 : q0 = 0.001 (left) and q0 = 0.9 (right). The thick curve corresponds
to the subset of this parameter space where B = Bc , the critical bifurcation value. If B > Bc , grammar
G2 overtakes; if B < Bc , grammar G1 prevails; and if B = Bc , the system falls in an equilibrium where
qt = q0 for all t (Theorem 3). Yangs (2000) model corresponds to the dashed line running at B = 0 (no
contextual effects, or contexts wholly in balance) and thus predicts that G2 wins for any 0 < < 1 and
loses for any > 1. Since B has both a lower and an upper limit (1 B 1), the advantage ratio has
critical values, dependent on the boundary condition q0 , such that if lands beyond one of these values,
no amount of out-of-balance bias can overthrow the advantage-induced dynamical outcome (Corollary 2).
In the figure on the left, for instance, any ratio greater than about 2 guarantees G2 to fail. In the figure
on the right, any ratio smaller than about 0.5 guarantees G2 to succeed, no matter what the combination
and magnitude of contextual biases
Proof Clearly 1 B 1 since 1 bi 1 and 0 i 1. If < q0 /(1 + q0 ),

then Bc < 1. If > (2 q0 )/(1 q0 ), then Bc > 1.
Figure 8 illustrates.
3.2 Logistic approximation
The above results show that the outcome of grammar competition in the presence
of context-specific production biases is determined by a complicated interaction be-
tween these biases (bi ), the proportion of input that falls in each context (i ) and the
ratio of the parsing advantages of the two competing grammars (). This is because
in our model the language acquisition mechanism and the production bias mechanism
constitute a feedback loop across iterated applications over multiple generations of
language learners, the production biases modulating the acquisition of the grammati-
cal weights pt and qt . As we noted in Sect. 2.3 (Fig. 7), this feature of the model also
implies that when the production biases are particularly strong, they will cause the
evolution of the underlying grammar weights to be non-logistic. The feedback loop
gives rise to a nonlinear difference equation for which we have no solution in the
general case, and the following problem immediately arises: how can the predictions
of our model be tested against empirical data if there is no closed-form curve which
to fit?
Even though the evolution of qt is, strictly speaking, logistic only when B = 0,
eyeballing trajectories such as the one in Fig. 7b suggests that these trajectories are
Fig. 9 (a) Error of fit (sum of squared residuals; nonlinear least squares regression) of a logistic function to
trajectories of the underlying probability qt generated by our model for various combinations of advantage
ratio and net bias B, for initial condition q0 = 0.01 (1% usage of G2 at the point of actuation). The
dashed vertical lines give the critical value Bc of the bifurcation parameter for each selection of . We
find that trajectories of qt are closely approximated by logistics except in the immediate vicinity of the
bifurcation threshold Bc at which change from G1 to G2 is blocked. (bc) Best-fitting slope (s) and
intercept (k) values found by these regressions
still S-shaped and perhaps well approximated by logistics. To explore this possibility,
we performed a sweep across the model parameter space, generating trajectories of
qt from the initial condition q0 = 0.01 (1% usage of G2 at the point of actuation) in
the regime B > Bc (i.e. in the parameter regime where G2 is guaranteed to oust G1
by Theorem 3), until qt had reached the value qt = 1 q0 = 0.99. We then proceeded
to fit a logistic curve to each of these trajectories in order to investigate how well the
trajectory may be approximated by a logistic. Figure 9a gives the errors of these fits,
showing that the trajectories are closely approximated by logistics whenever is not
too large and B is not too close to the critical bifurcation threshold Bc .
Figures 9bc supply the best slope (s) and intercept (k) coefficients found by these
regressions. We find that s is a decreasing function of and an increasing function
of B: the more advantageous G2 is, and the more G2 is favoured by the production
biases, the steeper the underlying change, as one would expect. The intercept coeffi-
cient k, in turn, is an increasing function of and a decreasing function of B: the less
advantage G2 has and the more the production biases tend to disfavour G2 , the more
the curve of the underlying change is shifted towards positive time.
4 Evaluation
4.1 The Time Separation Theorem
Under the logistic approximation from Sect. 3, the usage of grammar G2 in the con-
texts i = 1, . . . , K is described by a set of K equations
(1)
qt = qt + b1 qt (1 qt )

.. (18)
.
(K)
qt = qt + bK qt (1 qt )
where qt is a logistic function approximating the true underlying probability qt . What

historical language corpora give us are usage frequencies in various contexts, and we
therefore wish to fit curves of the form (18) to such data. The fact that under the
logistic approximation all such curves are tied to qt , which itself has a closed-form
solution, now facilitates this empirical evaluation: even though the individual context
(i)
curves qt themselves are not logistic (unless bi = 0), they are easily derived from
one that is. In what follows, we will take a look at a number of case studies, fitting
context curves with the help of a nonlinear least squares optimization algorithm.
Estimating the goodness of fit of these regressions is one important goal of this
exercise. However, our main aim is to solve the non-linking problem identified in
Sect. 1.2. Specifically, we wish to demonstrate that our model does not allow ar-
bitrarily large time separations between contexts, operationalized as the difference
between the points in time at which different context curves reach their tipping point,
or the point in time at which the context frequency of the overtaking grammar equals
0.5 when (18) is generalized for real-valued t. The logistic approximation gives us a
straightforward proof of this.
Theorem 4 (The Time Separation Theorem) For any two contextual reflexes of an
underlying change from G1 to G2 approximated by a logistic qt with slope s, the
maximal time separation at tipping points is

2 1 1
(s) = log 1.76 . (19)
|s| 21 |s|
Proof Appendix A.4.

It is to be noted that (s) is inversely proportional to sthe slower the rate of

change, the more time separation is allowed between any two contexts and vice versa.
To fit the system (18) to a set of data points, we first define reasonable ranges of
variation for the s and k parameters of the logistic qt that we wish to probe. We then
loop through the values contained in these ranges, finding the best fitting bias param-
eters bi for each pair (s, k) using a nonlinear least squares optimization algorithm
such as the GaussNewton procedure (Bates and Watts 1988), bearing in mind the
bounds 1 bi 1. Finally, out of all these regressions, we pick the combination of
s, k and bi that provides the best fit to the data in question. The whole procedure is
detailed in pseudocode in Appendix A.5.10
We now proceed to an evaluation of the model by comparing its predictions against
three historical changes for which a CRE has been reported in the literature: the emer-
gence of periphrastic do in the history of English (Kroch 1989), the earliest stages of
the English Jespersen Cycle (Wallage 2013), and, to take a phonological example to
illustrate the generality of the procedure, the loss of final fortition in Early New High
German (Fruehwald et al. 2009). In Sects. 4.24.4 we first briefly summarize the lin-
guistics of each change, reproduce the relevant empirical data, and visualize the fit of
our model to the data when the regression is conducted using the procedure outlined
above. In Sect. 4.5, we take a more quantitative angle and report the numerical errors
of these fits, comparing them to the errors that an application of the standard proce-
dure based on individual logistics (cf. Sect. 1.1) would produce. Finally, in Sect. 4.6,
we take a look at a pseudo-CREa case where the standard independent logistics
operationalization reports a CRE but where this conclusion is patently absurd from
other considerations (cf. Sect. 1.2)in order to investigate whether or not our model,
too, is prone to report false positives in such cases.
4.2 Periphrastic do in English
The first case study we will consider is perhaps the best known instance of a CRE:
Krochs (1989) interpretation of Ellegrds (1953) data on the rise of periphrastic do
in Early Modern English. The variable in question is whether a form of do is used in
certain contexts, as in (20a), or not, as in (20b) (examples from Kroch 1989:216).
(20) a. Where doth the grene knyght holde hym?
Where does the Green Knight hold him?
b. How great and greuous tribulations suffered the Holy Appostyls . . . ?
How great and grievous tribulations did the Holy Apostles suffer?
In modern standard English, a form of do is required in a number of contexts, in-
cluding all interrogatives as well as negative declaratives. What Ellegrds (1953) data
show is that, on the surface, the use of do appears to take off in the different contexts
at different rates: for instance, between around 1500 and 1650, negative questions ex-
hibit a much higher proportion of do than affirmative wh-object questions or negative
declaratives, though the latter contexts eventually catch up (Table 1). Kroch (1989)
conducted a regression on these contexts and showed that logistic curves fitted to
10 R (R Core Team 2012) code implementing the procedure can be obtained from the authors.
Table 1 Proportion of do in five different contexts: negative declaratives, negative questions, affirma-
tive transitive questions, affirmative intransitive questions, affirmative wh-object questions. From Kroch
(1989:224, Table 3)
Period neg. dec. neg. q. aff. tr. q. aff. intr. q. aff. obj. q.
14001425 0.000 0.117 0.000 0.000 0.000

14251475 0.012 0.080 0.107 0.000 0.000
14751500 0.048 0.111 0.135 0.000 0.020
15001525 0.078 0.590 0.242 0.211 0.113
15251535 0.137 0.607 0.692 0.197 0.095
15351550 0.279 0.750 0.615 0.319 0.110
15501575 0.380 0.854 0.737 0.423 0.360
15751600 0.238 0.648 0.792 0.444 0.383
16001625 0.367 0.937 0.773 0.619 0.298
16251650 0.317 0.842 0.909 0.757 0.530
16501700 0.460 0.923 0.947 0.702 0.549
them individually did not differ much in their slope (s) parameters, although they did
differ in terms of the intercept (k) parameter. This particular example, while perhaps
the most celebrated instance of a CRE, is in fact not the most straightforward instance
found in the literature, primarily due to a dip in the later portion of the change in
some contexts, which makes the progression of do non-monotonic; see Warner (2005)
and Ecay (2015) for detailed discussion, concluding that other factors (and possibly
another grammar) are at play. Kroch (1989) also identifies the dip and consequently
focusses on the first seven data points of Ellegrds (1953) data only; we follow his
practice here.
When the algorithm from Appendix A.5 is used to fit a model of the form (18) to
these data, the picture in Fig. 10 emerges. On visual inspection, the fit is a good one.
Crucially, our model is constrained to allow only so much time separation between
any two contexts of one change (Theorem 4), illustrated in Fig. 10 as the horizon-
tal bar extending both ways from the tipping point of the curve of the underlying
grammar probability. We find that fitting the model to Ellegrds (1953) data drives
two of the context curves (negative questions and affirmative object questions) to the
very extremes of the range licensed by the modelin other words, for these contexts,
the production biases need to be maximal in order for the model to fit the data. Cru-
cially, however, the data are described well by the regression curves so obtained, an
observation which we back up quantitatively in Sect. 4.5.
4.3 English Jespersen Cycle
Our second case study, also from the history of English, involves the replacement
of preverbal ne/ni by postverbal not during the Middle English period. This change
Fig. 10 Fit of our model to the data on English periphrastic do (curves: model; points: data from Table 1,
first seven periods). On visual inspection, the fit to each context is a good one. Theorem 4 implies a
maximal time separation, illustrated here as the horizontal bar extending both ways from the tipping point
of the theoretical curve for the underlying grammatical change (no production biases). The best-fitting
parameters found by the regression are s = 0.031, k = 1547.677, with bias sizes bi as follows: 0.885
for negative declaratives, 1.000 for negative questions, 0.656 for affirmative transitive questions, 0.647
for affirmative intransitive questions, and 1.000 for affirmative wh-object questions. With s = 0.031, the
maximal time separation licensed by the model is roughly 56.5 years
involves an intermediate stage in which both ne and not co-occur. The three stages
are illustrated in (21a)(21c) (examples from Wallage 2008:644).
(21) a. We ne moten halden Moses e lichamliche.

we NEG need observe Moses law bodily
We need not observe Moses law in body.
b. ac of hem ne speke ic noht
but of them NEG spoke I not
but I did not speak of them
c. I know nat the cause.
I know not the cause
I do not know the cause.
This replacement of negators, a cross-linguistically common diachronic pathway,

is referred to as Jespersens Cycle; see Wallage (2008) and Ingham (2013) for de-
tailed discussion of the English development. For our purposes, the change that is
important is the replacement of Stage 1 of Jespersens Cyclenegation by ne alone,
as in (21a)with Stage 2, bipartite negation, as exemplified by (21b). Wallage (2013)
shows that Stage 2 is favoured with discourse-old propositions during the middle of
the change, but that a CRE obtains (Table 2). Again, on purely visual inspection,
our model fits the data well, and the variation observed between discourse-old and
discourse-new propositions falls, roughly, within the time bounds prescribed by the
Time Separation Theorem (Fig. 11).
Table 2 Proportion of Stage 1

negation (preverbal ne) in the Period Discourse-old Discourse-new
English Jespersen Cycle in
discourse-old and discourse-new 11501250 38/245 (0.155) 335/393 (0.852)
contexts. From Wallage 12501350 9/338 (0.027) 135/346 (0.390)
(2013:12, Table 1)
13501420 0/244 (0.000) 2/294 (0.007)
Fig. 11 Fit of our model to the data on the first two stages of the English Jespersen Cycle (curves: model;
points: data from Table 2). On visual inspection, the fit to each context is a good one, though the poor
time resolution of the data is a problem. Theorem 4 implies a maximal time separation, illustrated here
as the horizontal bar extending both ways from the tipping point of the theoretical curve for the under-
lying grammatical change (no production biases). The best-fitting parameters found by the regression are
s = 0.016, k = 1203.788, with bias sizes bi as follows: 1.000 for discourse-old propositions and 1.000
for discourse-new propositions. (Note that in a case like this where the slope s is negative, a negative con-
text bias means a preference for the overtaking grammar, whereas a positive bias indicates preference for
the receding one.) With s = 0.016, the maximal time separation licensed by the model is roughly 110
years
4.4 Loss of final fortition in Early New High German
CREs are not found only with syntactic variables. Fruehwald et al. (2009) reanalyse
data from Glaser (1985) on the loss of final fortition in (Bavarian) Early New High
German, which is observable in orthographic variation of the period, e.g. tak vs. tag
day (acc. sg.), rat vs. rad counsel (acc. sg.). They argue that the orthographic vari-
ation clearly represents a phonological change in progress rather than shifting scribal
tradition, and that fortition is the result of a single phonological rule whose loss is
visible to different degrees in different contexts during the period of the change: /d/
exhibits fortition the most and /g/ the least, with /b/ showing an intermediate pattern
(Table 3). Our model describes the data well, with the observed CRE again falling
within the time bounds implied by the model (Fig. 12).
This example illustrates that even though our model is based on the variational
learner in Yang (2000), which is essentially a hypothesis about parameter setting in
syntax, the logistic approximation (18) which underlies the curve-fitting procedure
Table 3 Proportion of fortition

of the three plosives /b/, /d/ Year /b/ /d/ /g/
and /g/ in Early New High
German. From Fruehwald et al. 1276 18/18 (1.00) 29/29 (1.00) 54/73 (0.74)
(2009:4, Table 1) 1373 10/18 (0.56) 24/29 (0.83) 17/76 (0.22)
1483 2/18 (0.11) 2/24 (0.08) 0/78 (0.00)
1523 2/16 (0.12) 3/9 (0.33) 0/73 (0.00)
Fig. 12 Fit of our model to the data on loss of final fortition in Early New High German (curves: model;
points: data from Table 3). On visual inspection, the fit to each context is a good one. Theorem 4 implies a
maximal time separation, illustrated here as the horizontal bar extending both ways from the tipping point
of the theoretical curve for the underlying grammatical change (no production biases). The best-fitting
parameters found by the regression are s = 0.019, k = 1374.747, with bias sizes bi as follows: 0.353 for
/b/, 1.000 for /d/, and 1.000 for /g/. With s = 0.019, the maximal time separation licensed by the
model is roughly 93 years
can legitimately be used to model CREs in any domain as long as the assumption
of an underlying logistic change is justifiable. As Fruehwald et al. (2009:9) point
out, [t]he discovery of the Constant Rate Effect in phonological change is perfectly
expected under normal generative theories of phonology when the mechanism of
change is grammar or rule competitionthe learning algorithm a language learner
uses in this case may (or may not) be different from the one assumed in Yangs (2000)
model, but this notwithstanding, as long as some sort of underlying representation
similar to the Yangian weights pt and qt can be assumed to exist, our production bias
mechanism may be applied.
4.5 Comparison with standard procedure
Sections 4.24.4 have adduced evidence to the effect that the model introduced in this
paper can account for the CRE: the model gives fair fits to historical data even though
it is constrained by the chronological bounds set by the Time Separation Theorem
(Theorem 4). To make this argument more quantitatively, in this section we compare
Fig. 13 Error (sum of squared residuals normalized by number of data points) of the fit of our model
(grey) and the standard procedure (black) for the three changes examined in Sects. 4.24.4: periphrastic do
in Early Modern English, Jespersen Cycle in Middle English, and loss of final fortition in Early New High
German. Generally speaking, the more constrained model defined in this paper does not fare worse than
the less constrained, theoretically unmotivated standard operationalization. We suspect that the exception-
ally good fit of the standard operationalization for the Jespersen Cycle is accounted for by sparsity of data
(6 data points only), which means that any model that is little constrained will be favoured disproportion-
ately
these three fits to the standard operationalization of the CRE: that is, logistic curves
agreeing in the s (slope) parameter but differing in their k (intercept) parameters. For
each of the three case studies considered in Sects. 4.24.4, then, we carry out two fits:
one for our model, and another one for a model consisting of a set of logistic curves
(one per context) where the s parameter is not allowed to vary between contexts but
where such variation is allowed for the k parameter.
We quantify the goodness of fit of these regressions in the usual way, by the nor-
malized sum of squared residuals: in other words, for each context of a given change,
for each time period, we calculate the displacement between the empirically attested
frequency and the value predicted by the model, square this displacement, sum over
all time periods and over all contexts, and divide by the number of data points. Thus,
the better the fit, the closer the sum of squared residuals is to zero. Some deviance
from zero is always to be expected because of the noisy nature of historical language
data. However, this measure is still able to capture the difference between models
which are good fits, but subject to noise, from models which are simply bad fits to
the data in question.
Figure 13 shows a comparison of the goodness of fit of the two models for each of
the three case studies, operationalized using the sum of squared residuals. The crucial
finding is that our model, which places more constraints on the shape and placement
of the regression curves, fares no worse than the standard procedure in two out of
three cases: in other words, a more constrained, theoretically motivated model which
generates empirical predictions (in the form of the Time Separation Theorem) per-
forms no worse than a less constrained, theoretically unmotivated model. The excep-
tion to this are the data on the English Jespersen Cycle, where the less constrained
standard formulation reports a very low error. This appears to be a result of the very
small number of data pointsjust three time periods and two contextsfor this par-
ticular case study. Low data resolution necessarily gives a disproportionate advantage
to the less constrained model over any model that incorporates more assumptions.
Fig. 14 An Anglo-Bavarian pseudo-CRE that attempts to combine Jespersens Cycle in Middle English
with loss of final fortition in Early New High German: data from Tables 2 and 3. The different contexts
exhibit similar rates of change across the two historical changes by accident: the slope of the underlying
change is 0.016 for the English Jespersen Cycle and 0.019 for Early New High German fortition (see
captions to Figs. 11 and 12). This means that the standard same slope, different intercepts procedure
for detecting CREs in historical data is liable to produce a false positive in this case. Our model, which
implies an upper bound on the time separation possible between any two contexts of one underlying change
(Theorem 4), can help to diagnose a change such as this as a pseudo-change
4.6 A pseudo-CRE
Above, we have shown that the proposed model can account for the CRE, in the sense
that it gives good fits to three historical changesfits which are, in two of these
cases, no worse than fits conducted using the standard operationalization of CREs.
It remains to be shown that introducing this more constrained model can actually
solve some of the underspecification issues the standard operationalization suffers
from. As discussed in Sect. 1.2, the method of same slopes, different intercepts is
susceptible to false positives: the fact that a number of logistics agree in their s or
slope parameters is insufficient evidence that a single underlying change is at hand
(see Corley 2014; Wallenberg 2016; Willis 2017 for examples where the contexts
cannot possibly be assumed to be evidence of underlying grammatical unity).
Here, we construct a pseudo-CRE by combining the two changes investigated in
Sections 4.3 and 4.4: the early stages of the English Jespersen Cycle and loss of
fortition in Early New High German. As it turns out, these two changes happen to
propagate at very similar paces by accident (Fig. 14). The standard operationalization
of CREs is, then, expected to report a CRE between changes to English sentential
negation and Bavarian phonology, a conclusion which is clearly absurd.
The fact that the two changes are separated in time, however, means that the more
constrained model introduced in this paper can correctly diagnose the pseudo-CRE.
Figure 15 gives the residual errors of both the present model and the standard one for
this pseudo-CRE, along with the errors for the actual CREs investigated above. The
Fig. 15 Error (sum of squared

residuals normalized by number
of data points) of the fit of our
model (grey) and the standard
procedure (black) for the three
changes examined in
Sects. 4.24.4, plus the
pseudo-CRE of Fig. 14. Our
model correctly distinguishes
the pseudo-CRE from the actual
CREs
pattern that emerges is striking: for each of the actual CREs, our model reports an
error on the order of 0.005, whereas for the Anglo-Bavarian pseudo-CRE the model
generates an error that surpasses 0.03. The standard operationalization, by contrast,
reports similar errors for all changes, failing to distinguish between the pseudo-CRE
and the actual CREs.
The pseudo-CRE thus sheds light on the manner in which our model constrains
variation in the time dimensiona constraint that is not built into the model as a
premise but that follows from first principles in the form of the Time Separation The-
orem. Ultimately, the amount of time separation allowed between any two contextual
reflexes of a single underlying change depends on s, the slope of the underlying logis-
tic qt . To obtain a more intuitive interpretation of the relationship between contextual
time separations and the rate of the underlying change, it is useful to convert the slope
parameter into a quantity that measures the time the change needs to go from actu-
ation to completion, using the time-to-completion calculations proposed by Ingason
et al. (2013:9697). Namely, it can be shown that for slope s,

2 1 q0
Tq0 (s) = log (22)
|s| q0
gives the time it takes for a change to proceed from initial frequency q0 to final fre-
quency 1 q0 , for any (small) q0 with 0 < q0 0.5. Choosing q0 = 0.01, a reason-
able choice corresponding to 1% usage of the new variant at the point of actuation,
this inverse slope then gives a time-to-completion of

2 0.99 2 1
T0.01 (s) = log = log(99) 9.2 (23)
|s| 0.01 |s| |s|
time units (e.g. years) for any slope s. Theorem 4, on the other hand, implies a maxi-
mal time separation of

2 1 1
(s) = log 1.8 (24)
|s| 21 |s|
units between any two contexts of a change proceeding at rate s. Since 1.8/9.2 0.2,
this means that the maximal time separation between any two contexts of a single
underlying change is roughly a fifth of the time it takes for the change to go to com-
pletion in any context individually.
This fact can be used as a heuristic to evaluate purported CREs. For the Anglo-
Bavarian pseudo-CRE, for instance, the time-to-completion for q0 = 0.01 is
2
T0.01 (0.0175) = log(99) 525 (25)
0.0175
years when calculated for slope s = 0.0175, which is the arithmetic mean of the
slopes found by our regressions for the two changes previously (see captions to
Figs. 11 and 12). This implies that the time separation between any two contexts
should be no more than 0.2 525 = 105 years. On visual inspection, however, the
empirical time separation between the discourse-old and /d/ contexts must be at
least 300 years (Fig. 14). This, essentially, is why the model is able to diagnose the
pseudo-CRE.
5 Discussion
In this paper, we have augmented Yangs (2000, 2002) variational learner with pro-
duction biases that vary by contextthe first time, to our knowledge, that this has
been done.11 We have used this model to make precise the important intuition of
Kroch (1989) that variation between contexts in the increasing use of a new variant
may, under certain circumstances, be due to the interaction of a single underlying
change with fixed contextual biases. Two important issues remain to be discussed:
the diachronic implications of the interaction between language acquisition and pro-
duction biases, and the nature of the production biases themselves.
5.1 Which grammar wins?
Yangs (2000) Fundamental Theorem of Language Change, given earlier as Theo-

rem 1, can be paraphrased as follows: when grammars compete, the one with the
greater parsing advantage will win. In Sect. 3 we have shown that this result does not
hold in our model. Instead, bias and advantage together determine which grammar
will triumph: the precise way in which this works is given in our Extended Funda-
mental Theorem (Theorem 3).
In one sense this result is unsurprising: one of the great virtues of Yangs (2000,
2002) model of the learner and of diachronic change is its simplicity, and our model
introduces additional complexity. It is therefore not a particular surprise that our more
complex model does not yield the same intuitive generalization. On the other hand,
it is not a necessary consequence of this complication that the Fundamental Theorem
fails to hold. As Fig. 8 shows, if we add to our model the stipulation that contextual
weights must always be precisely in balance (B = 0), then Yangs Fundamental The-
orem does hold. Such a stipulation would be wholly unmotivated, as far as we are
aware, and represents a more complex model than ours.
11 Clark et al. (2008) present a filtered version of Yangs model to account for typological skews. In this
model universal biases play a role, but crucially the biases apply to grammars as a whole rather than to
particular contexts of use.
Moreover, we have not, of course, shown that the Fundamental Theorem of Lan-
guage Change is falsemerely that it is false under the assumptions we make.
Whether or not it is false, empirically speaking, depends on how well our model,
and Yangs (2000, 2002) model, correspond to reality: specifically, whether a model
that incorporates the effect of contextual biases as ours does is more realistic than
one that does not, and more realistic than one that constrains the net bias. We think
that is right, but it is likely that a full consideration of the facts of real-life acquisition
and change will require a model that is substantially more complex than any that has
been proposed thus far. One feature of Yangs model, with or without our extension,
is that it is completely impossible for a grammar G2 to overtake and defeat another
grammar G1 if the weak generative capacity of G2 is a proper subset of that of G1 ;
yet a preference for exactly this kind of subset is often invoked in the context of
acquisition in the form of the Subset Principle (Berwick 1985; Manzini and Wexler
1987), and, in the domain of phonology at least, retreat to the subset is a frequently-
attested diachronic pathway, since unconditioned mergers are well-attested and have
precisely the effect of reducing the number of forms generated by the grammar (see
e.g. Labov 1994:551). Future work will need to address these questions of realism, as
well as pursuing further analytical consequences of simpler (and thus more tractable)
models like this one.
5.2 The nature of production biases
Up to now we have remained mute with respect to the ontology of production bi-
ases, beyond stating that they are biases that affect production. In principle, such
biases could assume a number of forms. In a word order change such as OV to
VO, for instance, one possibility for interpreting the fixed biases we have proposed
is as a reflection of performance pressures in the sense of Hawkins (1994, 2004).
Hawkinss (2004:38) principle of Minimize Domains states that [t]he human pro-
cessor prefers to minimise the connected sequences of linguistic forms and their con-
ventionally associated syntactic and semantic properties in which relations of com-
bination and/or dependency are processed. This general principle is made concrete
using a metric of Early Immediate Constituents (EIC), which serves to favour syntac-
tic structures with a uniform directionality of branching. Importantly, EIC does not
penalize right-branching (e.g. VO) or left-branching (e.g. OV) grammars directly, in-
stead disfavouring individual structures with a disharmonic directionality of branch-
ing, for instance when a head-final VP is embedded under a head-initial TP. This is
equivalent to a context-specific production bias in our sense. Hawkins conceptualizes
Minimize Domains and EIC as principles of parsing rather than of production, but
notes that there is evidence that EIC might be involved in production too (Hawkins
2004:106), and states that if EIC can be systematically generalized from a model
of comprehension to a model of production [. . . ] then so much the better (Hawkins
1994:427). Hawkinss principles have also been reformulated as principles of deriva-
tional/computational complexity by Mobbs (2008) and Walkden (2009).
In phonological change, meanwhile, the biases can be interpreted as well estab-
lished articulatory phonetic effects. Final fortition, for example, is known to be more
likely to apply to velar consonants than to coronal consonants and more likely to
apply to coronal consonants than labial consonants (Ohala 1983), and this order of
preference seems to be observed diachronically as final fortition emerged in the his-
tory of Frisian (Tiersma 1985).12 These two phonological and syntactic examples are
intended to give a flavour of how contextual production biases can be interpreted, not
to exhaust the range of possibilities. For other changes, other biases might be neces-
sary: for instance, in Wallages (2013) data, discourse-old propositions favour Stage
2 of the Jespersen Cycle during the change, and the biases here could plausibly reflect
Gricean maxims of cooperative communication.
The above-mentioned biasesconstraints on syntactic processing, articulatory
pressures, pragmatic principlesare plausibly innate in the sense that they are shared
by all speakers across all languages and are not subject to change. This is why in our
model definition we maintained that the biases bi be diachronically constant. Note
that the logic here is not just that constant biases imply Constant Rate Effectswe
are actually defending the stronger claim that Constant Rate Effects occur if, and only
if, diachronically constant biases impinge on an underlying change. Time-dependent
biases, or random biases, would result in change processes in which the trajectories
of different linguistic contexts are not parallel to each othersomething we might
refer to as an Inconstant Rate Effect. Having said that, it is not inconceivable that
some non-innate biases are constant on sufficiently long timescales so that they may
give rise to Constant Rate Effects: this will be the case when the underlying change
itself is fast enough to be carried to completion within the timeframe in which the
biases stay fixed. This could be true of certain sociolinguistic biases, and here the bi-
asing mechanism of our model is in agreement with sociolinguistic work (e.g. Labov
2001:ch. 9) which has found some types of sociolinguistic bias modulation to be
strongest midway through the change, just as in our model (cf. Fig. 5).13
6 Conclusion
Building on earlier work that derives logistic evolution as a population-level property

of language change (Niyogi and Berwick 1997; Yang 2000, 2002), we have pro-
vided a mechanism for the Constant Rate Effect proposed by Kroch (1989). We have
done this by enriching Yangs (2000, 2002) model of acquisition and change with a
contextual bias mechanism that links different context curves to a single underlying
change. The work also provides a method of testing for CREs that is demonstrably
superior to the traditional method of same slope, different intercepts, since it is a
consequence of the model that there is a fixed upper bound on the time separation of
contextual curves. We have shown that this enables us to distinguish certain types of
pseudo-CREs from instances in which a single underlying grammatical change is ac-
tually plausible. We have also shown that advantage, in the Yangian sense, is not the
only factor at play in determining the ultimate outcome of a situation of grammatical
competition, if the basic assumptions of our model hold true.
12 We might expect to find the reverse effect in the data from Fruehwald et al. (2009) discussed above, but,
curiously, we do not: /g/ is the most favouring context for the loss of devoicing.
13 We are grateful to Charles Yang for bringing this point to our attention.
The upshot of all this is that it is now possible to test whether divergent usage
frequencies in corpora across different contexts during the course of a change in
fact mask a deeper underlying grammatical homogeneity, and to do so in a more
restricted and principled way than has been possible to date. Crucially, the method
we propose is not only methodologically superior to the standard operationalization
of CRE testing: our model in fact derives the possibility of CREs, and sets tight
bounds on the kind of empirically observed situation that can be said to constitute a
CRE.
Acknowledgements We thank Ricardo Bermdez-Otero, Laurel MacKenzie, David Willis, Charles

Yang and the reviewers of this paper for comments that led to important improvements, as well as David
Denison, Aaron Ecay, Tobias Galla, Anthony Kroch, Gertjan Postma, Alexandra Simonenko and Richard
Zimmermann for valuable discussions on the subject. Early versions of the paper were delivered at the 22nd
International Conference on Historical Linguistics (Naples, 2731 July 2015), the workshop on Dealing
with Bad Data in Linguistic Theory (Amsterdam, 1718 March 2016), the 39th Generative Linguistics
in the Old World conference (Gttingen, 58 April 2016), New Ways of Analyzing Syntactic Variation 2
(Ghent, 1920 May 2016), and the 18th Diachronic Generative Syntax conference (Ghent, 29 June1 July
2016); we thank our audiences, conference organizing committees and reviewers. HK gratefully acknowl-
edges financial support from Emil Aaltonen Foundation and The Ella and Georg Ehrnrooth Foundation.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Inter-
national License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribu-
tion, and reproduction in any medium, provided you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license, and indicate if changes were made.
Appendix: Derivations
A.1 Evolution of qt
Theorem 5 Let 0 = 1 and

1
1 qt
qt+1 = 1 + t . (A.1)
qt
Then, for all t,

q0
qt = , (A.2)
q0 + Lt t (1 q0 )
t1
where Lt = =0 for t 1 and L0 = 1.
Proof Induction. For t = 0,
q0 q0
q0 = = = q0 . (A.3)
q0 + L0 0 (1 q0 ) q0 + 1 q0
Now assume that the claim holds for t. Then

1 qt 1
qt+1 = 1 + t
qt
qt
=
qt + t (1 qt )
q0
q0 +Lt t (1q0 )
= q0 q0
q0 +Lt t (1q0 ) + t (1 q0 +Lt t (1q0 ) ) (A.4)
q0
=
q0 + t (q0 + Lt t (1 q0 ) q0 )
q0
=
q0 + t Lt t (1 q0 )
q0
=
q0 + Lt+1 t+1 (1 q0 )
as desired.
Corollary 3 Let B = 0. Then the evolution of qt is logistic.
Proof Let B = 0. Then from equation (16) t = 1 for all t, so that

1 qt 1
qt+1 = 1 + . (A.5)
qt
Theorem 5 now implies
q0
qt = . (A.6)
q0 + (1 q
t
0)
Now assume qt = 0 and divide both the numerator and the denominator by q0 :
1
t 1 q0
qt = 1 +
q0
1
1 q0
= 1 + exp log t
q0
1
t 1 q0
= 1 + exp log + log (A.7)
q0
1
1 q0
= 1 + exp t log() + log
q0
1
1 1 q0
= 1 + exp log() t + log .
log() q0
Hence, qt is logistic with s = log() and k = log()

1
log( 1q0
q0 ).
Corollary 4 In Yangs (2000) model, the weight qt evolves as

1
1 q0
qt = 1 + t
, (A.8)
q0
which is a logistic function of t.
Proof In this model, qt obeys (A.1) with t = 1 for all t. From Theorem 5,
q0
qt = . (A.9)
q0 + (1 q
t
0)
Assuming, without loss of generality, that q0 = 0, we derive

1 q0 1
qt = 1 + t . (A.10)
q0
This is a logistic function by Corollary 3.
A.2 The bias-modulating functions F and G
Theorem 6 Let

pt(i) = pt + F (bi , pt )
. (A.11)
q (i) = q + G(b , q )
t t i t
Then

(i)
pt + q t = 1
(i)
(A.12)
0 p (i) , q (i) 1
t t
if and only if

F = G
. (A.13)
|F |, |G| min{pt , qt }
Proof Writing F = F (bi , pt ), G = G(bi , qt ), p = pt and q = qt , the first require-

ment in (A.12) implies that
F = G, (A.14)
since
p +F +q +G = 1 only if p +F +1p +G = 1 only if F +G = 0. (A.15)
The second requirement in (A.12), on the other hand, implies F q and F p,

since
p+F 1 only if F 1p=q (A.16)
and
0p+F only if F p; (A.17)
hence
|F | min{p, q}, (A.18)
and an exactly symmetric argument shows that
|G| min{p, q}. (A.19)
Thus

F = G
. (A.20)
|F |, |G| min{p, q}
On the other hand, if (A.20) holds, then
pt(t) + qt(i) = p + F + q + G = p + F + q F = p + q = 1, (A.21)
and
(i)
|F | min{p, q} only if F p only if 0 p + F = pt . (A.22)
Similarly,
(i)
|F | min{p, q} only if F q only if F 1p only if pt 1,
(A.23)
(i)
and an analogous argument shows that 0 qt 1.
A.3 Proof of the Extended Fundamental Theorem
Let
1
Bc (, qt ) = . (A.24)
1 + qt ( 1)
To prove Theorem 3, we make use of the following auxiliary result:
Theorem 7 For all t:

1. qt+1 > qt if B > Bc (, qt );
2. qt+1 = qt if B = Bc (, qt );
3. qt+1 < qt if B < Bc (, qt ).
Proof From (14) and (16),
1
qt+1 = (A.25)
1 + t 1q
qt
t
with
1 qt B
t = . (A.26)
1 + (1 qt )B
To prove the first claim, we examine the difference qt+1 qt . Now qt+1 qt > 0 if
and only if
1 qt t (1 qt )
>0 iff
1 + t 1q
qt
t
1 qt t (1 qt ) > 0 iff
t (1 qt ) < 1 qt iff (A.27)
1
t < iff

1 qt B 1
< .
1 + (1 qt )B
Now, 1 B 1, so always 1 qt B > 0 and 1 + (1 qt )B qt > 0. Hence
1 qt B 1
< iff
1 + (1 qt )B
(1 qt B) < 1 + (1 qt )B iff
qt B < 1 + (1 qt )B iff
(A.28)
B(1 qt + qt ) > 1 iff

B 1 + qt ( 1) > 1 iff
1
B>
1 + qt ( 1)
as desired. Claims 2 and 3 are handled similarly.
Corollary 5 For all t:

1. qt+1 > qt if B > Bc (, q0 );
2. qt+1 = qt if B = Bc (, q0 );
3. qt+1 < qt if B < Bc (, q0 ).
Proof First, we note that Bc (, qt ) is a decreasing function of qt : if q < Q, then

Bc (, q) Bc (, Q).
Now let B > Bc (, q0 ). Then by Theorem 7 q1 > q0 . Since Bc is decreasing,
Bc (, q1 ) Bc (, q0 ) < B. Hence q2 > q1 by Theorem 7. By full induction, qt+1 >
qt for all t.
Let B = Bc (, q0 ). Then by Theorem 7 q1 = q0 . Then Bc (, q1 ) = Bc (, q0 ) =
B, so that q2 = q1 by Theorem 7. By full induction, qt+1 = qt for all t.
Finally, let B < Bc (, q0 ). Then by Theorem 7 q1 < q0 . Since Bc is decreasing,

Bc (, q1 ) Bc (, q0 ) > B, so that q2 < q1 by Theorem 7. By full induction, qt+1 <
qt for all t.
Theorem 3 now follows from Corollary 5: since qt is bounded between 0 and 1 and
q0 may be chosen arbitrarily close to 1 or 0, qt 1 if B > Bc (, q0 ) and qt 0 if
B < Bc (, q0 ) in the limit t .
A.4 Proof of the Time Separation Theorem
We shall now prove Theorem 4, reproduced here as Theorem 8:
Theorem 8 For any two contextual reflexes of an underlying change from G1 to G2

approximated by a logistic qt with slope s, the maximal time separation at tipping
points is

2 1
(s) = log . (A.29)
|s| 21
Proof Assume two contexts 1 and 2. The maximal separation will of course be at-
tained with maximal biases b1 = 1 and b2 = 1 (or vice versa). Assume this. Then
(2) (2)
qt = qt qt (1 qt ) = qt2 , and so qt = 1/2 when qt = 1/ 2. On the other hand,
qt is given by the logistic
1
qt = , (A.30)
1 + est
where we assume k = 0 because translation along the time axis obviously makes no
difference to the
argument here. A little algebra now shows that q t = 1/ 2 if andonly
(2)
if t = s log( 2 1). Thus, qt attains its tipping point at time t2 = 1s log( 2
1
1). Since the logistic qt itself attains its tipping point at t = 0 and since b 1 = b2 ,
symmetry implies that qt attains its tipping point at t1 = 0 t2 = 1s log( 2 1).
(1)
Hence
2
(s) = |t1 t2 | = |2t1 | =
| log( 2 1)|
|s|

2 2 1
= log( 2 1) = log (A.31)
|s| |s| 21
as wished.
A.5 Curve-fitting algorithm for the extended model
This appendix provides the curve-fitting algorithm used for model evaluation in
Sect. 4, in pseudocode. We assume the existence of three subroutines: R EGRESS,
PARAM and E RROR. The first of these can be any optimization algorithm that per-
forms nonlinear regression on the ith context for given s and k, fitting a curve of the
form (18) subject to the lower and upper bounds 1 bi 1. The second routine is
assumed to return the bias size bi found by this regression, and the third to give the
error (in terms of the normalized sum of squared residuals) of the fit.
1: procedure F IT-CRE
2: K number of contexts
3: S range of s values
4: K range of k values
5: s current best-fitting s
6: k current best-fitting k
7: b vector of length K to hold best-fitting bias sizes
8: E error of fit, initialized to a large value
9: for s S do
10: for k K do
11: E 0 (current error of fit)
12: b = (b1 , . . . , bK ) vector of length K to hold bias sizes
13: for i {1, . . . , K} do
14: F R EGRESS(s, k, i)
15: bi PARAM(F )
16: E E + E RROR(F )
17: end for
18: if E < E then
19: s s
20: k k
21: b b
22: E E
23: end if
24: end for
25: end for
26: return s , k , b , E
27: end procedure
References
Bates, Douglas M., and Donald G. Watts. 1988. Nonlinear regression analysis and its applications. New
York: Wiley.
Berwick, Robert C. 1985. The acquisition of syntactic knowledge. Cambridge: MIT Press.
Bush, Robert R., and Frederick Mosteller. 1951. A mathematical model for simple learning. Psychological
Review 68: 313323.
Bush, Robert R., and Frederick Mosteller. 1958. Stochastic models for learning. New York: Wiley.
Chomsky, Noam. 1981. Lectures on government and binding. Dordrecht: Foris.
Chomsky, Noam, and Howard Lasnik. 1993. The theory of principles and parameters. In Syntax: An inter-
national handbook of contemporary research, eds. Joachim Jacobs, Arnim von Stechow, Wolfgang
Sternefeld, and Theo Vennemann, Vol. 1, 506569. Berlin: De Gruyter.
Clark, Brady, Matthew Goldrick, and Kenneth Konopka. 2008. Language change as a source of word
order correlations. In Variation, selection, development: Probing the evolutionary model of language
change, eds. Regine Eckardt, Gerhard Jger, and Tonjes Veenstra, 75102. Berlin: De Gruyter.
Corley, Kerry. 2014. The constant rate hypothesis in syntactic change: Empirical fact or lies, damned lies,
and statistics? BA diss., University of Cambridge.
Durham, Mercedes, Bill Haddican, Eytan Zweig, Daniel Ezra Johnson, Zipporah Baker, David Cockeram,
Esther Danks, and Louise Tyler. 2012. Constant linguistic effects in the diffusion of be like. Journal
of English Linguistics 40: 316337.
Ecay, Aaron W. 2015. A multi-step analysis of the evolution of English do-support. PhD diss., University
of Pennsylvania. Available at http://repository.upenn.edu/edissertations/1049/. Accessed 27 August
2017.
Ellegrd, Alvar. 1953. The auxiliary do: The establishment and regulation of its use in English. Stockholm:
Almqvist and Wiksell.
Fruehwald, Josef, Jonathan Gress-Wright, and Joel C. Wallenberg. 2009. Phonological change: The con-
stant rate effect. In North East Linguistic Society (NELS) 40. Amherst: GLSA.
Gardiner, Shayna. 2015. Taking possession of the constant rate hypothesis: Variation and change in An-
cient Egyptian possessive constructions. University of Pennsylvania Working Papers in Linguistics
21: 6978.
Glaser, Elvira. 1985. Graphische Studien zum Schreibsprachwandel vom 13. bis 16. Jahrhundert. Heidel-
berg: Carl Winter.
Hawkins, John A. 1994. A performance theory of order and constituency. Cambridge: Cambridge Univer-
sity Press.
Hawkins, John A. 2004. Efficiency and complexity in grammars. Oxford: Oxford University Press.
Heycock, Caroline, and Joel Wallenberg. 2013. How variational acquisition drives syntactic change: The
loss of verb movement in Scandinavian. Journal of Comparative Germanic Linguistics 16: 127157.
Ingason, Anton Karl, Julie Anne Legate, and Charles Yang. 2013. The evolutionary trajectory of the Ice-
landic New Passive. University of Pennsylvania Working Papers in Linguistics 19: 91100.
Ingham, Richard. 2013. Negation in the history of English. In The history of negation in the languages of
Europe and the Mediterranean, eds. David Willis, Christopher Lucas, and Anne Breitbarth, Vol. 1:
Case studies, 119150. Oxford: Oxford University Press.
Kallel, Amel. 2005. The loss of negative concord and the constant rate hypothesis. University of Pennsyl-
vania Working Papers in Linguistics 10: 128142.
Kallel, Amel. 2007. The loss of negative concord in Standard English: Internal factors. Language Variation
and Change 19: 2749.
Kroch, Anthony. 1989. Reflexes of grammar in patterns of language change. Language Variation and
Change 1: 199244.
Kroch, Anthony. 1994. Morphosyntactic variation. In 30th annual meeting of the Chicago Linguistic Soci-
ety (CLS), ed. Katharine Beals et al., 180201. Chicago: Chicago Linguistic Society.
Kroch, Anthony. 2000. Syntactic change. In The handbook of contemporary syntactic theory, eds. Mark
Baltin and Chris Collins, 629739. Oxford: Blackwell.
Labov, William. 1994. Principles of linguistic change, Vol. 1: Internal factors. Oxford: Blackwell.
Labov, William. 2001. Principles of linguistic change, Vol. 2: Social factors. Malden: Blackwell.
Manzini, M. Rita, and Kenneth Wexler. 1987. Parameters, binding theory, and learnability. Linguistic
Inquiry 18: 413444.
Mobbs, Iain. 2008. Functionalism, the design of the language faculty, and (disharmonic) typology. MPhil
dissertation, University of Cambridge. Available at http://ling.auf.net/lingbuzz/000680. Accessed 27
August 2017.
Narendra, Kumpati S., and Mandayam A. L. Thathachar. 1989. Learning automata: An introduction. En-
glewood Cliffs: Prentice-Hall.
Nevalainen, Terttu, and Helena Raumolin-Brunberg. 2003. Historical sociolinguistics: Language change
in Tudor and Stuart England. London: Pearson.
Niyogi, Partha, and Robert C. Berwick. 1997. A dynamical systems model for language change. Complex
Systems 11: 161204.
Ohala, John J. 1983. The origin of sound patterns in vocal tract constraints. In The production of speech,
ed. Peter F. MacNeilage, 189216. New York: Springer.
Paolillo, John C. 2011. Independence claims in linguistics. Language Variation and Change 23: 257274.
Pintzuk, Susan. 1995. Variation and change in Old English clause structure. Language Variation and
Change 7: 229260.
Pintzuk, Susan. 2003. Variationist approaches to syntactic change. In The handbook of historical linguis-
tics, eds. Brian D. Joseph and Richard D. Janda, 509528. Oxford: Blackwell.
Pintzuk, Susan, and Ann Taylor. 2006. The loss of OV order in the history of English. In The handbook of
the history of English, eds. Ans van Kemenade and Bettelou Los, 249278. Malden: Blackwell.
Postma, Gertjan. 2010. The impact of failed changes. In Continuity and change in grammar, eds. Anne
Breitbarth, Christopher Lucas, Sheila Watts, and David Willis, 269302. Amsterdam: John Ben-
jamins.
Postma, Gertjan. 2017. Modelling transient states in language change. In Micro-change and macro-change
in diachronic syntax, eds. ric Mathieu and Robert Truswell. Oxford: Oxford University Press.
R Core Team. 2012. R: a language and environment for statistical computing. Vienna: R Foundation for
Statistical Computing. Available at http://www.R-project.org/. Accessed 27 August 2017.
Roberts, Ian. 2007. Diachronic syntax. Oxford: Oxford University Press.
Santorini, Beatrice. 1992. Variation and change in Yiddish subordinate clause word order. Natural Lan-
guage and Linguistic Theory 10: 595640.
Santorini, Beatrice. 1993. The rate of phrase structure change in the history of Yiddish. Language Variation
and Change 5: 257283.
Tiersma, Pieter M. 1985. Frisian reference grammar. Dordrecht: Foris.
Walkden, George. 2009. Deriving the Final-over-Final Constraint from third factor considerations. Cam-
bridge Occasional Papers in Linguistics 5: 6772.
Wallage, Phillip. 2008. Jespersens Cycle in Middle English: Parametric variation and grammatical com-
petition. Lingua 118: 643674.
Wallage, Phillip. 2013. Functional differentiation and grammatical competition in the English Jespersen
Cycle. Journal of Historical Syntax 2: 125.
Wallenberg, Joel C. 2016. Extraposition is disappearing. Language 92: e237e256.
Warner, Anthony. 2005. Why DO dove: Evidence for register variation in Early Modern English. Language
Variation and Change 17: 257280.
Willis, David. 2017. Investigating geospatial models of the diffusion of morphosyntactic innovations: The
Welsh strong second-person singular pronoun chdi. Journal of Linguistic Geography 5: 4166.
Yang, Charles D. 2000. Internal and external forces in language change. Language Variation and Change
12: 231250.
Yang, Charles D. 2002. Knowledge and learning in natural language. Oxford: Oxford University Press.

Kauhanen - Walkden - 2017 - Deriving The Constant Rate Effect

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kauhanen - Walkden - 2017 - Deriving The Constant Rate Effect

Uploaded by

Copyright:

Available Formats

Nat Lang Linguist Theory

Deriving the Constant Rate Effect

Henri Kauhanen1 George Walkden2

Received: 10 December 2015 / Accepted: 24 April 2017

peting grammars is neither a sufficient nor a necessary condition of language change

Keywords Constant Rate Effect Language change Dynamical systems

1.1 The Constant Rate Effect

Fig. 1 A classical example of a

Fig. 2 Three logistic curves (1)

1.2 The non-linking problem

changespossibly separated by millennia on the time axisas long as they happen

changes are to be thought of as competition between two or more parametric options

The paper is structured as follows. In Sect. 2, we augment Yangs (2000) mathemat-

2 Grammar competition and production biases

2.1 Learning competing grammars

Fig. 3 Venn diagram of the

Assuming ct = 0 and dt = 0 without loss of generality, this equation may be reduced

Theorem 1 (The Fundamental Theorem of Language Change; Yang 2000:239) As-

Fig. 4 Venn diagram of the

2.2 Competing grammars and contextual biases

Fig. 5 The biasing functions F

and the following is a theorem.

Theorem 2 Functions F = F (bi , pt ) and G = G(bi , qt ) satisfy the conditions (7)

Proof Appendix A.2.

where the contextual biases bi range from 1 (maximally G1 -favouring) through 0

8 We further elaborate on the empirical grounding of our biasing mechanism in Sect. 5.

model: equation (4) becomes

2.3 The Constant Rate Effect

To summarize, we propose to augment Yangs (2000) model of grammar competition

us to consider extreme model parameter regimes, particularly the subspace where B

3 Dynamics of the extended model

3.1 Advantage versus bias

Theorem 3 (The Extended Fundamental Theorem of Language Change) Assume re-

Proof Appendix A.3.

In dynamical-systems terminology, the production bias mechanism induces a bi-

Corollary 1 If < 1 and bi 0 for all contexts i, qt 1 as t .

Proof Since Bc < 0 for any choice of q0 , if < 1.

Corollary 2 If < q0 /(1 + q0 ), then qt 1 as t , regardless of the value of

Proof Clearly 1 B 1 since 1 bi 1 and 0 i 1. If < q0 /(1 + q0 ),

3.2 Logistic approximation

4.1 The Time Separation Theorem

where qt is a logistic function approximating the true underlying probability qt . What

Proof Appendix A.4.

It is to be noted that (s) is inversely proportional to sthe slower the rate of

4.2 Periphrastic do in English

14001425 0.000 0.117 0.000 0.000 0.000

4.3 English Jespersen Cycle

(21) a. We ne moten halden Moses e lichamliche.

This replacement of negators, a cross-linguistically common diachronic pathway,

Table 2 Proportion of Stage 1

4.4 Loss of final fortition in Early New High German

Table 3 Proportion of fortition

4.5 Comparison with standard procedure

Fig. 15 Error (sum of squared

5.1 Which grammar wins?

Yangs (2000) Fundamental Theorem of Language Change, given earlier as Theo-

5.2 The nature of production biases

Building on earlier work that derives logistic evolution as a population-level property

Acknowledgements We thank Ricardo Bermdez-Otero, Laurel MacKenzie, David Willis, Charles

Theorem 5 Let 0 = 1 and

Then, for all t,

Proof Induction. For t = 0,

Now assume that the claim holds for t. Then

It is to be noted that (s) is inversely proportional to sthe slower the rate of

Theorem 5 Let 0 = 1 and

Proof Let B = 0. Then from equation (16) t = 1 for all t, so that