Professional Documents
Culture Documents
net/publication/274514948
CITATIONS READS
20 624
1 author:
Volker Gast
Friedrich Schiller University Jena
87 PUBLICATIONS 738 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Volker Gast on 14 June 2016.
Volker Gast
Friedrich Schiller University Jena
This article argues for a type of corpus-based contrastive research that is item-
specific, predictive and hypothesis-driven. It reports on a programmatic study
of the ways in which impersonalization is expressed in English and German.
Impersonalization is taken to be epitomized by human impersonal pronouns like
German man (e.g. Man lebt nur einmal ‘You/one only live(s) once’). English does
not have a specialized impersonal pronoun like Germ. man and uses a variety
of strategies instead. The question arises what determines the choice of a given
impersonalization strategy in English. Drawing on relevant theoretical work and
using data from a translation corpus (Europarl), variables potentially affecting
the distribution of impersonalization strategies in English are identified, and
their influence on the choice of a strategy is determined. By testing hypotheses
derived from theoretical work and using multivariate quantitative methods of
analysis, the study is intended to illustrate how bridges can be built between
fine-grained semantic analyses, on the one hand, and more coarse-grained, but
empirically valid, corpus research, on the other.
* The present study grew out of a project on human impersonal pronouns, funded by the
German Science Foundation (Ga/1288–6). The financial support from this institution is grate-
fully acknowledged. Some of the data used were coded by members of this project, especially by
Lisa Deringer. I am also greatly indebted to the other members of the project, Florian Haas and
Olga Rudolf, for their cooperation and input. Moreover, I wish to thank the audience of the 7th
International Contrastive Linguistics Conference (Ghent, July 2013) as well as two anonymous
reviewers and the editor of this issue for helpful comments and criticism.
1. Introduction
1. The Oslo Multilingual Corpus was developed at the University of Oslo under the direction of
S. Johansson and C. Fabricius-Hansen between 1999 and 2008.
Given the ambitiousness of the undertaking, the present study can only be pro-
grammatic. In view of the wide range of factors influencing the choice of a given
structure (by a given speaker) to render a specific meaning, predictions concern-
ing the linguistic matter used in any given case can only be approximations. Let
us consider an example in order to illustrate the idea of an item-based, predic-
tive and hypothesis-driven problem of contrastive corpus linguistics. Consider the
German sentence (1) with its interlinear glosses (an English translation will be
provided below), which has been taken from the Europarl corpus (cf. Section 2 be-
low). It is a German sentence that was translated from Spanish.2
(1) Herr Präsident!
Mr president
Es ist zweifellos nichts Neues, wenn man sagt,
it is without a doubt nothing new if imps says
dass sich die Europäische Union
that refl the European Union
an einem Scheideweg befindet.
at a crossroads stands [Europarl, v7]
The corpus also contains an English version of this sentence. Given this informa-
tion, what kind of prediction can we make with respect to the form of the English
sentence in the corpus?
As the possibilities of translating a sentence from one language into another
are manifold, this is of course an ambitious question to ask. We can be more mod-
est, however. Let us focus on the impersonal pronoun man, glossed as ‘imps’ in
(1). Man is used in a broad range of contexts. We will return to its distribution in
Section 4 below. For the time being, suffice it to say that there is no specific English
pronoun corresponding to Germ. man (for contrastive studies of impersonal pro-
nouns in Germanic languages, see for instance Johansson 2004, van der Auwera et
al. 2012). How can the meaning of (1) be rendered in English, then?
Let us rephrase this question from a theoretical point of view: What kind of
information do we need in order to determine the type of structure, or at least
a set of possible structures, to be used in an English equivalent of (1)? The most
2. “Señor Presidente, decir que la Unión Europea se encuentra en una encrucijada no es, sin
duda alguna, afirmar nada nuevo.”
As will be seen, sentences with the German impersonal pronoun man are only
rarely rendered in English using a similar pronoun (like one, for instance). What
we typically find are non-finite, sometimes nominal or adjectival constructions,
or passives (cf. Sect. 3.1). Rather than speaking about ‘impersonal pronouns’, we
should therefore use the functional concept of ‘impersonalization’ as a tertium
comparationis. It is defined in (3) (from Gast and van der Auwera 2013).
(3) Impersonalization is the process of filling an argument position of a
predicate with a variable ranging over sets of human participants without
establishing a referential link to any entity from the universe of discourse.
(Gast and van der Auwera 2013: 136)
Following this introductory section, some remarks about the corpus used for the
present study (Europarl) are made in Section 2. Section 3 provides a descriptive
overview of the data. In Section 4, the variables for the empirical study are de-
scribed and explained. Section 5 contains a multifactorial analysis using multino-
mial logistic regression. A model is built stepwise (bottom-up) (Sections 5.1–5.3),
and the predictions that the model makes are exemplified (Section 5.4). Section 6
contains the conclusions.
The Europarl corpus contains the proceedings of the European Parliament since
1996 (cf. Koehn 2005, Cartoni and Meyer 2012, Cartoni et al. 2013). The number
of languages rose from eleven in 1996 to 24 in 2014, reflecting the expansion of the
European Union during that time. As far as corpus design is concerned, it is impor-
tant to note that the texts were mostly translated manually and directly (i.e., n-to-n)
from 1996 to 2003. Since 2003, many texts have been translated into English first,
and then into the other languages of the European Union (cf. Cartoni et al. 2013: 35).
The corpus is freely accessible on the internet (http://www.statmt.org/europarl/).
The Europarl corpus contains planned spoken (political) speech and conse-
quently represents a highly formal register. In order to estimate the degree of reg-
ister specificity of the structures under investigation, I will therefore use some ad-
ditional data from the OpenSubtitles corpus (cf. http://opus.lingfil.uu.se), which
contains movie subtitles from 59 languages with a total number of 1,211 bitexts
(in April 2014).
For the present study, only data from English and German will be relevant.
Three types of pairs of texts (or sentences) can be distinguished, as far as the pa-
rameter ‘direction of translation’ is concerned: (i) English sentences and their
translations into German, (ii) German sentences and their translations into
English, and (iii) pairs of English and German sentences that were both translated
from some third language. For a contrastive study we obviously need to be able
to keep these cases apart. In other words, we need some terminology. As I am not
aware of any term for that concept, I will coin a neologism for any sentence s cor-
responding to some other sentence s’ from the translation corpus, and I will call s
and s’ heterophrases. Heterophrase is a relational term, i.e., a sentence si is always
a heterophrase of some other sentence sj. Heterophrases are pairs or sets of sen-
tences that are intended to render (approximately) the same meaning, in the same
context, irrespective of the source and direction of translation.
When a sentence s1 is an original, and another sentence s2 is a translation of s1,
s1 will be said to be the ‘original’ of s2, and s2 the ‘translation’ of s1. If two hetero-
phrases are translations from a third language, they will be metaphorically called
‘sisters’. Two sentences si and sj are sisters if they are heterophrases, but they were
not translated directly from one language into the other.
The notion of ‘impersonalization’ was introduced in (3) above, and it was pointed
out that English does not have a dedicated impersonal pronoun corresponding to
German man (cf. also Johansson 2004, van der Auwera et al. 2012). The question
arises what types of strategies we find in English corresponding to this pronoun.
The present study addresses this question on the basis of a random sample of 399
examples of man (both originals and translations) with their English heterophras-
es.3 This section provides a descriptive overview of the English strategies of ex-
pressing impersonalization, starting with the more general level of heterophrases
in Section 3.1, and moving on to translation effects in Section 3.2.
Let us first get an idea of the types of strategies that we find in the English versions
from the sample. Their frequencies are shown in Figure 1.4
We will not discuss every single strategy in the following and focus on the
most common types.5 As Figure 1 shows, the most frequent strategy is the passive,
illustrated in (4).
(4) man → passive
Ge: Ich gratuliere Ihnen zu dem, was man im deutschen Parlamentarismus
in Ihrem Fall unzulässigerweise eine Jungfernrede nennt.
En: I would like to congratulate you on what is referred to in German
parliamentary-speak, inappropriately in your case, as a maiden speech.
[Europarl, v7]
On second to fourth position, we find the first person plural pronoun we, nominal
constructions and infinitives. Examples of we and nominal expressions are given
in (5) and (6), an example of an infinitive was already given in (2) above.
50
40
30
20
10
0
psv we nom inf one ptcp you rphr they people I 3sg anyone imp some
people
Figure 1. English heterophrases of sentences with impersonal man in German (Europarl).
4. All diagrams were generated with the Software environment R (R Core Teacm 2013). Some
of them were edited manually.
5. Here is a list of some perhaps not self-explanatory abbreviations used in Figure 1: ptcp: par-
ticiple; rphr: complete rephrasing; imp: imperative.
(5) man → we
Ge: Je lockerer man mit den Sitten umgeht, desto stärker hält man die Hand
auf der Brieftasche.
En: Basically, the more we lose our grip on morals, the more we tighten our
grip on our wallet. [Europarl, v7]
(6) man → nominal expression
Außerdem konzentrierte man sich in diesen vergangenen Jahren
grundsätzlich auf die Sicherheit im Passagierverkehr.
‘During these years, the emphasis has mainly been on the safety of passenger
transport.’ [Europarl, v7]
The English impersonal pronoun one is found in a relatively low number of cases
(cf. (7) for an example), considering the (high) register of the corpus. Impersonal
you is much more common in spoken English (cf. also below for the data from
the OpenSubtitles corpus), but it is also sporadically found in the Europarl cor-
pus (cf. (8)).
(7) man → one
Ge: Zur Beseitigung dieser Ursachen muß man sich die Situation in den
betreffenden Ländern vergegenwärtigen.
En: In order to remove these causes, one has to look at what is happening in
the countries in question. [Europarl, v7]
(8) man → you
Ge: Man kann sich nicht vorbereiten, wenn man hier eine Erklärung hört
und gar nicht weiß, was Inhalt einer solchen Erklärung ist.
En: You cannot prepare if you hear a statement in this House and have no
idea of its content. [Europarl, v7]
The question of what determines and conditions the use of any given strategy is
the central topic of this article, and we will turn to it in Sections 4 and 5. Before
going into some more detail, let us compare the frequencies of impersonalization
strategies found in Europarl to those in the OpenSubtitle corpus, to get an idea of
the register specificity of the distribution shown in Figure 1. Figure 2 summarizes
the frequencies of impersonalization strategies in the OpenSubtitle corpus, based
on a randomly extracted sample of 500 examples.
In the OpenSubtitle corpus, you is by far the most frequent strategy, followed
by passives and the third person plural pronoun they. Examples of impersonal you
and they are given in (11) and (12).
(11) Ge: Ich weiß, dass man die Flügel nicht wieder ankleben kann.
En: I know you just can’t glue the wings back on. [OpenSubtitles]
(12) Ge: So etwas kann man noch nicht bauen!
En: They cannot make things like that yet. [OpenSubtitles]
Figure 3 displays a comparison of the frequencies in the two corpora in the form
of a Cohen-Friendly association plot (cf. Cohen 1980, Friendly 1992). It shows the
200
150
100
50
0
you psv they inf nom we zero one I pst.ptcp guy he anybody it that man who
prs.ptcp People imp a man anyone
B
people pst.ptcp
any-x I imp inf it nom one prs.ptcp psv they we who you zero
Pearson
residuals
Europarl 4.00
2.00
0.00
A
–2.00
–4.00
Opensubtitle
–8.14
p-value =
< 2.220-16
Let us now have a closer look at the distribution of strategies, taking into account
whether a sentence is an English original (which was translated into German us-
ing man), an English translation of German man, or a translation from a third
language (where the German and English sentences are ‘sisters’, using the termi-
nology established in Section 2). The data is represented in the form of an associa-
tion plot in Figure 4 (note that the labels on the left indicate the original language;
English → original, German → translation from German, other → translation
from other language).
psv adj I imp inf it nom one people pst.ptcp they we who you zero
anyone prs.ptcp they(anph) Pearson
some people residuals
2.38
2.00
English
nom one
0.00
A
German
–2.00
other
–3.04
p-Value =
0.023452
Figure 4. English originals, translations from German and from other languages.
The plot contains three shaded boxes. Passives are significantly overrepre-
sented in English originals, relative to translations. We is massively underrepre-
sented here. In sentences that were translated from German into English, you is
significantly overrepresented (i.e., man is relatively often translated into English
using you, in comparison to the other impersonalization strategies). The other dif-
ferences in the distributions are not statistically significant, but this is probably
at least partly due to the limited sample size, and by adding more data we could
certainly detect further interesting asymmetries. At this point we will not go into
any further detail with respect to translation effects, however. We will return to
the relation between heterophrases in the quantitative interpretation of the data.
Finding an answer to the question in (13) is, first and foremost, a theoretical chal-
lenge. It implies that we understand the factors that determine the distribution of
German man as well as its English counterparts. In other words, it implies that we
are able to answer the question in (14).
For a corpus-based study, we thus have to find variables which are candidates po-
tentially having an impact on the interpretation and distribution of the relevant
strategies. For each variable, we will formulate and test a hypothesis concerning its
impact. Building upon the results obtained in this section, we can then try and fit
a multinomial logistic regression model in Section 5. The independent variables
to be tested are the following:
– the type of quantification expressed (Section 4.1)
– semantic properties of the sentential context:
– generalizing vs. specific (Section 4.2)
– veridical vs. non-veridical (Section 4.3)
– modal(ized) vs. non-modal(ized) (Section 4.4)
– the clause type found in any given instance (Section 4.5)
– the source language (Section 4.6)
The semantic variables — the type of quantification expressed and the properties
of the sentential context — as well as the clause type were coded manually. The
information about the speaker was extracted automatically from the relevant tags
of the Europarl corpus. The dependent variable is the impersonalization strategy.
Given that some of the strategies are very rare and have moreover similar distri-
butional and semantic properties, I distinguish only eight levels for this variable
(unlike in Figure 1 above):
– a passivized predicate (psv), cf. (4)
– a nominal or adjectival construction (nom), cf. (6)
– a non-finite verbal construction (infinitives, imperatives) (nfin), cf. (2)
– the impersonal pronoun one, cf. (7)
– impersonal uses of the personal pronoun you, cf. (8)
– the first person plural pronoun we, cf. (5)
– pronouns other than you, one or we, e.g. anyone, they (pron), cf. the English
version of (12)
– a complete rephrasing of the sentence (rphr)
Examples have been provided for all the strategies except the last one (rephrasing).
It is illustrated in (15).6
(15) rephrasing of man-sentence
6. A reviewer points out that (15) could also be classified as ‘nominal’. I have chosen the ‘re-
phrasal’-option because the two sentences contain considerably different main predicates.
Ge: Man macht sich Sorgen um die “Verbuchung der Eigenmittel”, um ihre
“Bereitstellung” oder um die “Kontrolle” der der Kommission zur
Verfügung gestellten Beträge.
En: It shows a concern with the accounting involving own resources, the
process of making them available or with the monitoring of declared
amounts made available to the Commission. [Europarl, v7]
Given that the type of interpretation is an obvious candidate for being a parameter
that influences the choice of an impersonalization strategy in English, we can for-
mulate the hypothesis in (18) and test the corresponding null hypothesis in (19):
(18) H1: The type of quantification expressed by German man has an impact on
the use of impersonalization strategies in the English heterophrases.
(19) H0: The type of quantification expressed by German man has no impact on
the use of impersonalization strategies in the English heterophrases.
For the (manual) coding of the data we used a substitution test: Depending on
whether substitution with somebody or everybody led to approximately equivalent
sentences, the examples were classified as existential or universal. A χ2-test shows
that the null hypothesis in (19) can be rejected (χ2 = 20.5, df = 7, p = 0.005). The
data are shown in the form of an association plot in Figure 5.
strat.n
psv nom pron non-fin one you rephr we Pearson
residuals:
2.16
2.00
0
uni
0.00
1
–2.00
–2.12
p-Value =
0.004546
The first way of explaining the type of context sensitivity that characterizes imper-
sonal pronouns is to assume that the type of interpretive variability characteristic
of them is similar to, or even an instance of, what has been called ‘quantification
variability effects’ since Lewis (1975). An example is given in (20). The indefinite
NP a dog is here interpreted as being universally quantified over, as it occurs in the
scope of the adverb always.
(20) A dog is always smart. ≈ All dogs are smart.
One way of analysing this phenomenon is to assume that indefinites (of this type)
basically introduce variables, and that these variables are bound by some operator
in the sentential environment. A similar assumption can be made for impersonal
pronouns (see for instance Alonso-Ovalle 2002, Moltmann 2006). Accordingly,
the interpretation of man as an existential or universal quantifier can be derived
from the assumption that the quantifier binding the event variable also binds the
(otherwise) free variable introduced by man (or similar expressions).
We can thus formulate the following hypothesis H1, and the complementary null
hypothesis H0:
(27) H1: The status of a sentence as generalizing or specific has an impact on
the distribution of impersonalization strategies in English heterophrases of
German sentences with man.
(28) H0: The status of a sentence as generalizing or specific has no impact on
the distribution of impersonalization strategies in English heterophrases of
German sentences with man.
As an operational test we used the criterion whether the adverbial generally could
salva veritate be added. The data show that the null hypothesis H0 cannot be reject-
ed (χ2 = 11.0, df = 7, p = 0.14). Somewhat surprisingly perhaps, the parameter ‘gener-
alizing’ does not, in itself, seem to have a significant impact on the choice of an im-
personalization strategy, in the dataset used for the present study. Note that this is
probably not primarily due to scarcity of data. The data are simply relatively evenly
distributed, specifically in the most frequent strategies, i.e., ‘passive’, ‘nominal’ and
‘non-finite’. The frequencies are shown in Table 1 and visualized in Figure 6.
strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
1.43
0
gen
0.00
–1.44
p-value=
0.13716
Let us consider two attested examples from the corpus. (35) is classified as non-
veridical, (4) (repeated here) as veridical.
(35) Ge: In der Tat, wenn man die beiden Mitglieder, die sich gemeldet haben
hinzuzählt, dann ergibt sich als Ergebnis …
En: Indeed, if we add the two Members who have declared themselves, then
the result of the vote would be … [Europarl, v7]
(4) Ge: Ich gratuliere Ihnen zu dem, was man im deutschen Parlamentarismus
— in Ihrem Fall unzulässigerweise — eine Jungfernrede nennt.
En: I would like to congratulate you on what is referred to in German
parliamentary-speak, inappropriately in your case, as a maiden speech.
[Europarl, v7]
Note that the theory of non-veridicality seems to be incompatible with the fact
that one is licensed in generalizing contexts, even though it often sounds stilted
in such contexts (e.g. One only lives once). The present approach does not aim at
postulating categorical rules, but at finding parameters which, in interaction with
other factors, contribute to the grammaticality of some structure and determine
its distribution. Note, however, that generic sentences could in fact be argued to be
non-veridical relative to the minimal clause, as they can be regarded as conveying
an implicit conditional (cf. the representation of One only lives once in (21)).
We can, again, formulate two complementary hypotheses and test the null
hypothesis.
(36) H1: The status of the minimal clause containing an impersonalized
argument position as veridical or non-veridical has an impact on the
distribution of impersonalization strategies in English heterophrases of
German sentences with man.
(37) H0: The status of the minimal clause containing an impersonalized
argument position as veridical or non-veridical has no impact on the
distribution of impersonalization strategies in English heterophrases of
German sentences with man.
strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
1.93
0
ver
0.00
–1.69
p-value=
0.021054
On the basis of the dataset used for the present study, the null hypothesis in (39)
can be rejected (χ2 = 16.7, df = 7, p = 0.02). The data is visualized in Figure 8.
strat.n
psv nom pron one you rephr we
n-fin Pearson
residuals:
1.92
0
mod
0.00
–1.68
p-value=
0.019309
So far, we have only considered semantic variables: the type of quantification ex-
pressed in a sentence, the property of expressing a generalization vs. a particu-
lar statement, the question of (non-)veridicality and the presence vs. absence of a
modal. In this section we will consider a formal, basically syntactic, variable, i.e.
the type of clause in which man occurs.
The clause type is potentially relevant because the entire ‘ecology’ of the gram-
matical system of English is different from that of German. English has a category
that German lacks — the gerund (cf. (40)) — and it makes ample use of participial
constructions (cf. König and Gast 2012: Ch.13 for a contrastive overview). It is
therefore to be expected that in many cases where German uses man (in a finite
clause), English will use some non-finite structure. (41) is an example where (the
first occurrence of) German man was translated into English using a participle
(the second occurrence was translated with one).
(40) Ge: Liest man beide Kommissionsdokumente, so war 1998 ein Jahr, in dem
die 1997 eingeleiteten Modernisierungsvorhaben fortgeführt und in
Teilen noch abgeschlossen wurden; …
En: On reading both Commission documents, one learns that 1998 was the
year in which the modernisation proposals introduced in 1997 were
pursued and even partially completed, … [Europarl, v7]
(41) Ge: Herr Präsident, Frau Kommissarin, liebe Kolleginnen und Kollegen!
Wenn man die Flugblätter sieht, die in den letzten Wochen verteilt
wurden, dann denkt man, wir diskutieren hier über den Öko-Gau oder
über den Tod der Automobilindustrie in Europa.
En: Mr President, Commissioner, reading the leaflets which have been
distributed in recent weeks, one would think that we were talking about
the ultimate environmental catastrophe or the death of the car industry
in Europe. [Europarl, v7]
We can, again, formulate two complementary hypotheses and test the null
hypothesis:
(42) H1: The type of clause in which man occurs has an impact on the type of
impersonalization strategy found in English.
(43) H0: The type of clause in which man occurs has no impact on the type of
impersonalization strategy found in English.
I have classified the clause types into four categories: main/declarative clause
(m.dec), (direct or indirect) questions (q), adjunct clauses, i.e., adverbial clauses
and relative clauses (adj), and complement clauses (comp). This classification is
relatively coarse-grained, and perhaps the classes have not been chosen in an ideal
way. A more fine-grained classification would have implied a high number of cells
with low counts, which compromises statistical analyses. On the basis of this four-
way classification the null hypothesis cannot be rejected. In fact, the p-value is
rather high (χ2 = 27.37, df = 21, p = 0.16). The data is visualized in Figure 9.
strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
2.00
main/decl
question
typ.n
0.00
adjunct
comp –2.00
–2.23
p-value=
0.15907
Finally, it is obvious that in a study of translated text we need to take the ‘source
type’ into account — i.e. we need to know whether a sentence is an original or a
translation. Moreover, I distinguish two types of translations, those from German
and those from other languages. This distinction is made because the entire study
is based on German man as a ‘flag’ or ‘anchor’ for the argument structure opera-
tion of impersonalization.
The hypotheses concerning the source type can be phrased as follows:
(44) H1: The type of source of any given example (original, translation) has an
impact on the choice of an impersonalization strategy in English.
(45) H0: The type of source of any given example (original, translation) has no
impact on the choice of an impersonalization strategy in English.
The data is visualized in Figure 10. The null hypothesis can be rejected (χ2 = 43.97,
df = 18, p < 0.001).
strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
2.38
2.00
translation
0.00
langrel
original
–2.00
sisters
–3.04
p-value=
0.0015869
4.7 Summary
The statistics for the variables explored in the present section are summarized
in Table 2, ordered by the p-values. For the multifactorial analysis presented in
Section 5 below, only those variables will be taken into account for which the null
hypothesis of statistical independence could be rejected, i.e., veridical, modal,
quantification and source type.
Table 2. Main statistics for the variables under consideration in the present study.
variable χ2-value df p significance
clause type 27.37 21 0.16
generalizing 11.0 7 0.14
veridical 16.5 7 0.02 *
modal 16.7 7 0.02 *
quantification 20.5 7 0.005 **
source type 43.97 18 <.001 **
increases with the number of variables used in a model, or with their levels. For
both scores it holds that ‘the lower the better’.
The mlogit-package provides the (negative) log-likelihood value and McFadden’s
R for each model (cf. Smith and McKenna 2013).7 The absolute value of the nega-
2
The model with the best statistics is the one with the variable ‘source type’. As
Table 3 shows, in 104 cases the model assigns the highest probability to the strat-
egy that was actually used, i.e. in 26% of cases. This is a considerable improvement
in comparison with a model that simply chooses the most frequent strategy (pas-
sives), which would make a correct prediction in 23.3% of cases. The source type-
model simply selects the strategy ‘passive’ for English originals and for translations
from languages other than German, and ‘nominal’ for translations from German,
and this is the right choice in 26% of cases.
A one-variable model with the variable ‘modality’ correctly predicts an even
higher number of cases, though the goodness-of-fit indicators are slightly worse.
This model chooses the strategy ‘nominalization’ for sentences with a modal, and
‘passive’ for sentences without a modal. In this way it selects the correct outcome
in 26.8% of cases.
7. McFadden’s R2, “(sometimes referred to as ‘deviance R2’), is one minus the ratio of the full-
model log-likelihood to the intercept-only log-likelihood … ” (Smith and McKenna 2013: 18).
As these examples illustrate, one-variable models are very simple and do not
have much explanatory force. Still, they can give us a rough idea of the role that
each variable plays in determining the distribution of the English impersonaliza-
tion strategies.
There can be little doubt that the final model will contain the variable ‘source type’,
which was shown to be a relatively good predictor in Section 5.1, and which is
totally independent of the other (semantic) variables. We will therefore start with
combinations of the semantic variables, (non-)veridicality, modality and quanti-
fication. We will focus on combinations of quantification with (non-) veridicality
and modality. The statistics for these models (both with and without interactions)
are shown in Table 4.
The models with quantification and modality fare much better in all respects than
those with quantification and veridicality. As is shown by a likelihood ratio test,
quantification and modality interact, however. The model with interactions —
represented in the last row — is significantly better than the model without in-
teractions (LogLikq+m = −730.9, LogLikq*m = −719.13, df = 7, χ2 = 23.53, p = 0.001).
Both models are significantly better than the one-variable model with ‘quantifi-
cation’ (LogLikq = −738.3, LogLikq + m = −719.13, df = 14, χ2 = 38.29, p < 0.001). The
quant*mod-model makes correct predictions for 28.8% of cases ( = 115/399).
Let us now combine the variable ‘source type’ with each of the three semantic
variables that turned out to be significant. As likelihood ratio tests have revealed
no significant interactions between the variables shown in Table 5, only the mod-
els without interactions are shown.
The model with the variables ‘source type’ and ‘quantification’ is the one that
fares best in terms of the model statistics so far, even though it only predicts 113
outcomes correctly (in comparison to the 115 outcomes correctly predicted by the
quant*mod-model). The number of correctly predicted outcomes is not, however,
a very reliable indicator of goodness of fit, as was pointed out above.
We do not have many options to fit a model with three variables. The two linguis-
tically most reasonable models are shown in Table 6 (translation effects will be
integrated below). Note that there is an interaction between the variables ‘quantifi-
cation’ and ‘modality’, but not between ‘quantification’ and ‘veridicality’.
The model with source type, quantification and modality, as well as the interaction
between the latter variables, is the best model so far and it fits the data significantly
better than any two-variable model (LogLiks + q = −716.59, LogLiks + m + q = −697.47,
df = 14, χ2 = 38.23, p < 0.001). While the model statistics have continually improved
in the course of ‘model building’ — within a generally rather modest goodness of
fit — the number of correctly predicted outcomes has not changed significantly and
even dropped somewhat from the two-variable models to the three-variable models.
A final improvement can be achieved by fitting a four-variable model with
the predictors ‘source’, ‘quantification’, ‘modality’ (interacting with quantification)
and ‘veridicality’. While the model does not significantly change the test statistics
at a five percent level, it is not far from a significant improvement, with p = 0.08
(LogLiks + m q = −697.5, LogLiks + v + q*m = 691.13, df = 7, χ2 = 12.68). The number of
correctly predicted outcomes increases considerably, to 123 (i.e., 30.8%). The sta-
tistics for the four-variable model are summarized in Table 7.
As the discussion in the previous sections has shown, a multinomial logistic re-
gression model can be used to make predictions about impersonalization strate-
gies used in English. The accuracy of the models is of course quite limited, as can
be seen from the model statistics. However, the improvements resulting from the
stepwise addition of variables were significant, and, in my view, constitute an in-
teresting result, specifically insofar as they were tied to predictions made on the
basis of theoretical considerations, which thus received some support.
As was mentioned above, a model assigns probabilities to each outcome. Let
us consider one example in order to understand how this works. The German sen-
tence in (46) has the features in (47) (it is a translation from Dutch).
(46) Herr Präsident! Über die Bedeutung der Finanzhilfen für Zypern und Malta
ist man sich weitgehend einig. [Europarl, v7]
(47) Source: sister
Mod: 0
Ver: 1
Quant: univ
On the basis of this information, the four-variable model assigns the following
probabilities to each of the eight possible outcomes/strategies:
(48) psv nom pron nfin one you rphr we
19% 27% 10% 7% 6% 5% 4% 22%
The strategy ‘nominal’ has the highest percentage assigned by the model and is
thus correctly predicted in this case, as is shown by the English heterophrase of
(46):
(49) Mr President, there is a general consensus of opinion that financial aid to
Cyprus and Malta is important. [Europarl, v7]
This example shows why the CPO-value is of limited use as an indicator of good-
ness of fit: It only assesses the strategy with the highest probability. The model con-
tains information about the probabilities of the other strategies as well, however.
Given that the other indicators of goodness of fit are hard to interpret, linguisti-
cally speaking, I have decided to provide a CPO-value in each case anyway.
In the present article I have argued for a corpus-based contrastive research design
that is based on the comparison of individual sentences in a translation corpus.
Such studies allow us to make much more specific predictions than could be made
on the basis of a text-level comparison of comparable corpora. While it is true, of
course, that translated language differs in systematic ways from original language
— as was also confirmed by the present study — the advantage of allowing for a
comparison of individual pairs of sentences is an invaluable advantage of transla-
tion corpora. Very obviously, both types of corpora are needed and can be used for
different types of research questions.
On a descriptive level, the study has shown that in the corpus under investi-
gation, the distribution of impersonalization strategies is most prominently in-
fluenced by the source type, i.e., the question of whether a sentence is an English
original, or whether it was translated from either German or some other language.
Among the semantic variables, ‘type of quantification’ (existential/universal),
(non-)veridicality and modality are the most useful ones. The variable ‘(non-)gen-
eralizing’ has not been found to be a powerful predictor. Similarly, no significant
impact on the distribution of impersonalization strategies has been determined
for the variable ‘clause type’. It needs to be mentioned, however, that only four
levels were distinguished for this variable. More fine-grained studies, and studies
based on larger samples of examples, may show this variable to be a significant
contributor to the distribution of impersonalization strategies as well.
This disclaimer basically applies to the entire study, which is programmatic in
many respects. The model statistics indicated a rather modest goodness of fit, to
put it mildly. This is not surprising in view of the fact that linguistic choices are
multi-dimensional decisions, and that the present study has focused on matters of
sentence semantics and syntax. It is likely that information structure plays a role,
and that there are speaker-specific preferences as well as lexical effects. Integrating
such variables into the model would certainly be a worthwhile undertaking. Even
on the basis of a broader range of data, we have to reckon with a certain remnant
of random variation, however.
Even though the aim of predicting translational equivalents has only partially
been achieved, I hope to have shown that an item-specific, predictive and hypothe-
sis-driven approach to contrastive linguistics can lead to results that would be hard
or impossible to achieve otherwise. In particular, this type of research allows us to
bridge the gap between fine-grained theoretical studies and quantitative corpus
research. The ambitiousness of the project has been pointed out repeatedly, and
it should be obvious that there is a lot of room for improvement, on all accounts.
Most importantly, topics like impersonalization, which interact with various levels
References
Alonso-Ovalle, Luis. 2002. Arbitrary Pronouns are not that Indefinite. In C. Beyssade, R. Bok-
Bennema, F. Drijkoningen, and P. Monachesi (eds.), Romance Languages and Linguistic
Theory 2000, 1–14. Amsterdam: John Benjamins. DOI: 10.1075/cilt.232.02alo
van der Auwera, Johan, Volker Gast and Jeroen Vanderbiesen. 2012. Human Impersonal
Pronouns in English, Dutch and German. Leuvense Bijdragen 98: 27–64.
Cartoni, Bruno and Thomas Meyer. 2012. Extracting Directional and Comparable Corpora
from a Multilingual Corpus for Translation Studies. In Proceedings 8th International
Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey.
Cartoni, Bruno, Sandrine Zufferey and Thomas Meyer. 2013. Using the Europarl Corpus for
Cross-linguistic Research. Belgian Journal of Linguistics 27. 23–42.
DOI: 10.1075/bjl.27.02car
Cohen, A. 1980. On the Graphical Display of the Significant Components in a Two-way
Contingency Table. Communications in Statistics – Theory and Methods A. 1025–1041.
DOI: 10.1080/03610928008827940
Friendly, M. 1992. Graphical Methods for Categorical Data. In SAS User Group International
Conference Proceedings. 190–200.
Gast, Volker. 2012. Contrastive Linguistics: Theories and Methods. In B. Kortmann (ed.),
Dictionary of Linguistics and Communication Science: Linguistics Theory and Methodology.
Berlin: de Gruyter Mouton.
Gast, Volker and Johan van der Auwera. 2013. Towards a Distributional Typology of Human
Impersonal Pronouns, Based on Data from European Languages. In D. Bakker and M.
Haspelmath (eds), Languages across Boundaries. Studies in Memory of Anna Siewierska,
31–56. Berlin: de Gruyter Mouton.
Giannakidou, Anastasia. 1998. Polarity Sensitivity as (Non)Veridical Dependency. Amsterdam:
John Benjamins. DOI: 10.1075/la.23
Giannakidou, Anastasia. 2011. Negative and Positive Polarity Items. In Klaus von Heusinger
and Claudia Maienborn (eds), Semantics: An International Handbook, volume 33.2 of
Handbücher der Sprach- und Kommunikationswissenschaften, 1660–1712. Berlin: de
Gruyter Mouton.
Granger, Sylviane, Jacques Lerot and Stephanie Petch-Tyson, (eds), 2003. Corpus-Based
Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi.
Johansson, Stig. 1998. On the Role of Corpora in Cross-linguistic Research. In S. Johansson and
S. Oksefjell (eds.), Corpora and Cross-linguistic Research: Theory, Method, and Case Studies,
3–24. Amsterdam/Atlanta: Rodopi.
Johansson, Stig. 2000. Contrastive Linguistics and Corpora. SPRIKreports, Reports from the
project ‘Languages in Contrast’. University of Oslo. Available at http://www.hf.uio.no/ilos/
forskning/prosjekter/sprik/pdf/sj/johansson2.pdf [last accessed December 2014].
Johansson, Stig. 2004. Viewing Languages through Multilingual Corpora, with Special Reference
to the Generic Person in English, German and Norwegian. Languages in Contrast 4. 261–
280. DOI: 10.1075/lic.4.2.05joh
Koehn, Phillip. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. Pukhet:
MT Summit X.
Lewis, David. 1975. Adverbs of Quantification. In E. Keenan (ed.), Formal Semantics of Natural
Language, 3–15. Cambridge: Cambridge University Press.
DOI: 10.1017/CBO9780511897696.003
Moltmann, Friederike. 2006. Generic one, Arbitrary PRO, and the First Person. Natural
Language Semantics 14. 257–81. DOI: 10.1007/s11050-006-9002-7
Montague, Richard. 1969. On the Nature of Certain Philosophical Entities. The Monist 53. 159–
194. DOI: 10.5840/monist19695327
R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org/ [last ac-
cessed December 2014].
Schmied, Josef. 2008. Contrastive Corpus Studies. In A. Lüdeling and M. Kytö (eds), Corpus
Linguistics: An International Handbook, 1140–1159. Berlin: de Gruyter Mouton.
Smith, Thomas J. and Cornelius M. McKenna. 2013. A Comparison of Logistic Regression
Pseudo R2 Indices. Multiple Linear Regression Viewpoints 39. 17–26.
Zwarts, Frans. 1995. Nonveridical Contexts. Linguistic Analysis 25. 286–312.
Author’s address
Volker Gast
English Department
Friedrich Schiller University of Jena
Ernst-Abbe-Platz 8
07743 Jena
Germany
volker.gast@uni-jena.de